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Foreword 


Lotfi Zadeh 

Among my many Ph.D. students, some have forged new tools in their work. Jyh- 
Shing Roger Jang and Chuen-Tsai Sun fall into this category. Neuro- Fuzzy and Soft 
Computing makes visible their mastery of the subject matter, their insightfulness, 
and their expository skill. Their coauthor, Eiji Mizutani, has made an important 
contribution by bringing to the writing of the text his extensive experience in dealing 
with real-world problems in an industrial setting. 

Neuro-Fuzzy and Soft Computing is one of the first texts to focus on soft com- 
puting — a concept which has direct bearing on machine intelligence. In this 
connection, a bit of history is in order. 

The concept of soft computing began to crystallize during the past several years 
and is rooted in some of my earlier work on soft data analysis, fuzzy logic, and 
intelligent systems. Today, close to four decades after artificial intelligence (AI) 
was born, it can finally be said with some justification that intelligent systems are 
becoming a reality. Why did it take so long for the era of intelligent systems to 
arrive? 

In the first place, the AI community had greatly underestimated the difficulty 
of attaining the ambitious goals which were on its agenda. The needed technolo- 
gies were not in place and the conceptual tools in AI’s armamentarium — mainly 
predicate logic and symbol manipulation techniques — were not the right tools for 
building machines which could be called intelligent in a sense that matters in real 
world applications. 

Today we have the requisite hardware, software, and sensor technologies at our 
disposal for building intelligent systems. But, perhaps more important, we are also 
in possession of computational tools which are far more effective in the conception 
and design of intelligent systems than the predicate-logic-based methods, which 
form the core of traditional AI. The tools in question derive from a collection of 
methodologies which fall under the rubric of what has come to be known as soft 
computing (SC). In large measure, the employment of soft computing techniques 
underlies the rapid growth in the variety and visibility of consumer products and 
industrial systems which qualify to be assessed as possessing a significantly high 
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MIQ (machine intelligence quotient). 

The essence of soft computing is that unlike the traditional, hard computing, 
soft computing is aimed at an accommodation with the pervasive imprecision of the 
real world. Thus, the guiding principle of soft computing is to exploit the tolerance 
for imprecision, uncertainty, and partial truth to achieve tractability, robustness, 
low solution cost, and better rapport with reality. In the final analysis, the role 
model for soft computing is the human mind. 

Soft computing is not a single methodology. Rather, it is a partnership. The 
principal partners at this juncture are fuzzy logic (FL), neurocomputing (NC), and 
probabilistic reasoning (PR), with the latter subsuming genetic algorithms (GA), 
chaotic systems, belief networks, and parts of learning theory. The pivotal contri- 
bution of FL is a methodology for computing with words; that of NC is system 
identification, learning, and adaptation; that of PR is propagation of belief; and 
that of GA is systematized random search and optimization. 

In the main, FL, NC, and PR are complementary rather than competitive. For 
this reason, it is frequently advantageous to use FL, NC, and PR in combination 
rather than exclusively, leading to so-called hybrid intelligent systems. At this 
juncture, the most visible systems of this type are neuro-fuzzy systems. We are 
also beginning to see fuzzy-genetic, neuro-genetic, and neuro-fuzzy-genetic systems. 
Such systems are likely to become ubiquitous in the not distant future. 

In coming years, the ubiquity of intelligent systems is certain to have a pro- 
found impact on the ways in which human-made systems are conceived, designed, 
manufactured, employed, and interacted with. This is the perspective in which the 
contents of Neuro-Fuzzy and Soft Computing should be viewed. 

Taking a closer look at the contents of Neuro-Fuzzy and Soft Computing, what 
should be noted is that today most of the applications of fuzzy logic involve what 
might be called the calculus of fuzzy rules, or CFR for short. To a considerable de- 
gree, CFR is self-contained. Furthermore, CFR is relatively easy to master because 
it is close to human intuition. Taking advantage of this, the authors focus their 
attention on CFR and minimize the time and effort needed to acquire sufficient 
expertise in fuzzy logic to apply it to real-world problems. 

One of the central issues in CFR is the induction of rules from observations. In 
this context, neural network techniques and genetic algorithms play pivotal roles, 
which are discussed in Neuro-Fuzzy and Soft Computing, in considerable detail 
and with a great deal of insight. In the application of neural network techniques, 
the main tool is that of gradient programming. By contrast, in the application of 
genetic algorithms, simulated annealing, and random search methods, the existence 
of a gradient is not assumed. The complementarity of gradient programming and 
gradient-free methods provides a basis for the conception and design of neuro-genetic 
systems. 

A notable contribution of Neuro-Fuzzy and Soft Computing, is the exposition 
of ANFIS (Adaptive Neuro Fuzzy Inference System) — a system developed by the 
authors which is finding numerous applications in a variety of fields. ANFIS and 
its variants and relatives in the realms of neural, neuro-fuzzy, and reinforcement 
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learning systems represent a direction of basic importance in the conception and 
design of intelligent systems with high MIQ. 

Neuro-Fuzzy and Soft Computing is a thoroughly up-to-date text with a wealth 
of information which is well-organized, clearly presented, and illustrated by many 
examples. It is required reading for anyone who is interested in acquiring a solid 
background in soft computing — a partnership of methodologies which play pivotal 
roles in the conception, design, and application of intelligent systems. 


Lotfi A. Zadeh 




Preface 


During the past few years, we have witnessed a rapid growth in the number and 
variety of applications of fuzzy logic and neural networks, ranging from consumer 
electronics and industrial process control to decision support systems and financial 
trading. Neuro-fuzzy modeling, together with a new driving force from stochastic, 
gradient-free optimization techniques such as genetic algorithms and simulated an- 
nealing, forms the constituents of so-called soft computing, which is aimed at solving 
real-world decision-making, modeling, and control problems. These problems axe 
usually imprecisely defined and require human intervention. Thus, neuro-fuzzy and 
soft computing, with their ability to incorporate human knowledge and to adapt 
their knowledge base via new optimization techniques, axe likely to play increasingly 
important roles in the conception and design of hybrid intelligent systems. 

This book provides the first comprehensive treatment of the constituent method- 
ologies underlying neuro-fuzzy and soft computing, an evolving branch within the 
scope of computational intelligence that is drawing increasingly more attention as 
it develops. Its main features include fuzzy set theory, neural networks, data clus- 
tering techniques, and several stochastic optimization methods that do not require 
gradient information. In particular, we put equal emphases on theoretical aspects of 
covered methodologies, as well as empirical observations and verifications of various 
applications in practice. 


AUDIENCE 

This book is intended for use as a text in courses on computational intelligence at 
either the senior or first-year graduate level. It is also suitable for use as a self-study 
guide by students and researchers who want to learn basic and advanced neuro-fuzzy 
and soft computing within the framework of computational intelligence. Prerequi- 
sites axe minimal; the reader is expected to have basic knowledge of elementary 
calculus and linear algebra. 
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ORGANIZATION 

Chapter 1 gives an overview of neuro-fuzzy and soft computing. Brief historical 
traces of relevant techniques axe described to direct our first step toward the neuro- 
fuzzy and soft computing world. The remainder of the book is organized into the 
eight parts described next. 

Part I (Chapters 2 through 4) presents a detailed introduction to the theory and 
terminology of fuzzy disciplines, including fuzzy sets, fuzzy rules, fuzzy reasoning, 
and fuzzy inference systems. 

Part II (Chapters 5 through 7) provides an overview of system identification and 
optimization techniques that prove to be effective for neural-fuzzy and soft comput- 
ing. Chapter 5 introduces least-squares methods in the context of system iden- 
tification. Chapter 6 describes derivative-based nonlinear optimization techniques, 
including nonlinear least-squares methods. These two chapters are inevitably math- 
ematically oriented, and for the first reading, many sections labeled with an asterisk 
( £ * £ ) can be omitted. Chapter 7 discusses derivative-free optimization techniques, 
including genetic algorithms, simulated annealing, the downhill Simplex method, 
and random search. 

Part III (Chapters 8 through 11) introduces a variety of important neural net- 
work paradigms found in the literature, including adaptive networks as the most 
generalized framework for model construction, supervised learning neural networks 
for data regression and classification, reinforcement learning for infrequent and de- 
layed evaluative signals, unsupervised learning neural networks for data clustering, 
and some other networks that do not belong to any of the aforementioned categories. 

Part IV (Chapters 12 and 13) explains how to build ANFIS (Adaptive Neuro- 
Fuzzy Inference Systems) and CANFIS (Coactive Neuro-Fuzzy Inference Systems) 
as core neuro-fuzzy models that can incorporate human expertise as well as adapt 
themselves through repeated training. 

Part V (Chapters 14 through 16) covers structure identification techniques for 
neural networks and fuzzy modeling, including the CART (Classification and Re- 
gression Tree) method, which is quite popular in multivariate analysis of statistics; 
several data clustering algorithms aimed at batch-mode model building, and efficient 
rulebase formulation and organization via tree partitioning of input space. 

Part VI (Chapter 17 and 18) considers various approaches to the design of neuro- 
fuzzy controllers, including expert control, inverse learning, specialized learning, 
backpropagation through time, real-time recurrent learning, reinforcement learning, 
genetic algorithms, gain scheduling, and feedback linearization (in conjunction with 
sliding mode control). 

The last part, Part VII (Chapters 19 through 22), gives a variety of application 
examples in different domains, such as printed character recognition, inverse kine- 
matics problems in robotics, adaptive channel equalization, multivariate nonlinear 
regression, adaptive noise cancellation, nonlinear system identification, plasma spec- 
trum analysis, hand-written numeral recognition, game playing, and color recipe 
prediction. 
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Figure 0.1. Prerequisite dependencies among chapters of this book. 


The prerequisite dependencies among the individual chapters are shown in Fig- 
ure 0.1. This diagram arranges chapters with respect to the level of advancement, 
so the reader has some flexibility in studying the whole book. Sections marked with 
an asterisk (*) can be skipped for the first reading. 
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FEATURES 

The orientation of the book is toward methodologies that are likely to be of prac- 
tical use; many step-by-step examples are included to complement explanations in 
the text. Since one picture is worth thousands of words, this book contains spe- 
cially designed figures to visualize as many ideas and concepts as possible, and thus 
help readers understand them at a glance. Most scientific plots in the examples 
were generated by MATLAB® and SIMULINK® 1 . For the reader’s convenience, 
these MATLAB programs can be obtained by filling out the reply card in this book, 
or via FTP or WWW. See the next section for details of how to obtain these 
MATLAB programs. Some of the examples and demonstrations require the Fuzzy 
Logic Toolbox™ by The MathWorks Inc; the contact information is 

The MathWorks, Inc. 

24 Prime Park Way 
Natick, MA 01760-1500, USA 
Phone: (508) 647-7000 

Fax: (508) 647-7001 

E-mail: inf oQmathworks . com 

WWW : http : //www . mathworks . com 

Chapters 2 through 18 are each followed by a set of exercises; some of them 
involve MATLAB programming tasks, which can be expanded into suitable term 
projects. This serves to confirm and reinforce understanding of the material pre- 
sented in each chapter, as well as to equip the reader with hands-on programming 
experiences for practical problem solving. Hints to selected exercises are in the 
appendix at the end of this book. 

For instructors who use this book as a text, the solution manual is available 
from the publisher. A set of viewgraphs that contains important illustrations in the 
book is available for classroom use. These viewgraphs are directly accessible via 
the book’s home page at 

http : //www . cs . nthu . edu . tw/“ j ang/sof t . htm 
This is the place where the reader can do other things such as: 

• Get the most updated information (such as addendum, erratum, etc.) about 
the book. 

• Get enhancements and bug-fixes of MATLAB programs. 

• Give comments and suggestions. 

• View the statistics of comments and suggestions of other readers. 

• Link to the authors’ WWW home pages and other Internet resources. 


1 MATLAB and SlMULINK are registered trademarks of The MathWorks, Inc. 
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A reference list is given at the end of each chapter that contains references to 
the research literature. This enables readers to pursue individual topics in greater 
depth. Moreover, neuro-fuzzy and soft computing is a relatively new field and con- 
tinues to evolve rapidly within the scope of computational and artificial intelligence. 
These references provide an entry point from the well-defined core knowledge con- 
tained in this book to another dimension of innovative and challenging research and 
applications. 

OBTAINING THE EXAMPLE PROGRAMS 

For people without Internet access, the easiest way to get the free MATLAB pro- 
grams used in this book is to fill the reply card (bound in the book) and send it 
back to the Math Works. The Math Works will send you a floppy disk containing the 
MATLAB files, free of charge. (Some of the MATLAB programs rely on the Fuzzy 
Logic Toolbox, which is not freely available.) 

For people with Internet access, the example MATLAB programs are available 
electronically in two ways: by FTP (file transfer protocol) or WWW (worldwide 
web). The FTP address is 


f tp .mathworks . com 


The files are at 


/pub/books/ j ang/* 

For WWW access to the FTP site, the URL (universal resource locator) address is 
ftp : //ftp .mathworks . com/ pub/books/ jang/ 

You can also access it via the book’s home page at 

http : //www . cs . nthu . edu . tw/ ~ j ang/ soft . htm 
A sample FTP session is shown next, with what you should type in boldface. 

unix> ftp ftp.mathworks.com 
Connected to ftp.mathworks.com. 

220 ftp FTP server (Version wu-2.4(2) Tue Aug 1 10:31:36 EDT 1995) ready. 
Name (ftp . mathworks . com : slin) : anonymous 

331 Guest login ok, send your complete e-mail address as password. 
Password: slin@lsil.com ( Use your email address here.) 

230 Guest login ok, access restrictions apply. 
ftp> cd /pub/books/jang 
250 CWD command successful. 

f tp> binary ( You must specify binary transfer for some data files.) 

200 Type set to I . 

f tp> prompt (So you don’t need to transfer each file interactively.) 
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Interactive mode off. 

ftp> mget * ( Get every file in the directory.) 

200 PORT command successful. 

150 Opening BINARY mode data connection for addvec.m (691 bytes). 

226 Transfer complete. 

local: addvec.m remote: addvec.m 

691 bytes received in 0.011 seconds (62 Kbytes/s) 

200 PORT command successful. 

ftp> bye 
221 Goodbye. 

You may want to use a mirror site near you: 
ftp : //uni x . hens a . ac . uk/mirrors /mat lab 
ftp : //ftp . ask . uni-karlsruhe . de/pub/matlab 
ftp : / /novell . f elk . cvut . cz/pub/mirrors/mathwork 
ftp : / /ftp . u-aizu .ac.jp: /pub/vendor /mathworks 
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Chapter 1 


Introduction to 
Neuro-Fuzzy and 
Soft Computing 


1.1 INTRODUCTION 

Soft computing (SC), an innovative approach to constructing computationally in- 
telligent systems, has just come into the limelight. It is now realized that complex 
real-world problems require intelligent systems that combine knowledge, techniques, 
and methodologies from various sources. These intelligent systems are supposed to 
possess humanlike expertise within a specific domain, adapt themselves and learn to 
do better in changing environments, and explain how they make decisions or take ac- 
tions. In confronting real-world computing problems, it is frequently advantageous 
to use several computing techniques synergistically rather than exclusively, result- 
ing in construction of complementary hybrid intelligent systems. The quintessence 
of designing intelligent systems of this kind is neuro-fuzzy computing: neural 
networks that recognize patterns and adapt themselves to cope with changing envi- 
ronments; fuzzy inference systems that incorporate human knowledge and perform 
inferencing and decision making. The integration of these two complementary ap- 
proaches, together with certain derivative-free optimization techniques, results in a 
novel discipline called neuro-fuzzy and soft computing. 

As a prelude, we shall provide a bird’s-eye view of relevant intelligent system 
approaches, along with bits of their history, and discuss the features of neuro-fuzzy 
and soft computing. 

1.2 SOFT COMPUTING CONSTITUENTS AND CONVENTIONAL 
ARTIFICIAL INTELLIGENCE 


Soft computing is an emerging approach to computing which parallels 
the remarkable ability of the human mind to reason and learn in an 
environment of uncertainty and imprecision. (Lotfi A. Zadeh, 1992 [12]) 
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Figure 1.1. A neural character recognizer and a knowledge base cooperate in re- 
sponding to three hand-written characters that form a word “dog. ” 


Table 1.1. Soft computing constituents (the first three items) and conventional 
artificial intelligence. 


Methodology 

Strength 

Neural network 

Learning and adaptation 

Fuzzy set theory 

Knowledge representation 
via fuzzy if-then rules 

Genetic algorithm and 
simulated annealing 

Systematic random search 

Conventional AI 

Symbolic manipulation 


Soft computing consists of several computing paradigms, including neural net- 
works, fuzzy set theory, approximate reasoning, and derivative-free optimization 
methods such as genetic algorithms and simulated annealing. Each of these con- 
stituent methodologies has its own strength, as summarized in Table 1.1. The 
seamless integration of these methodologies forms the core of soft computing; the 
synergism allows soft computing to incorporate human knowledge effectively, deal 
with imprecision and uncertainty, and learn to adapt to unknown or changing en- 
vironment for better performance. For learning and adaptation, soft computing 
requires extensive computation. In this sense, soft computing shares the same char- 
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acteristics as computational intelligence. 

In general, soft computing does not perform much symbolic manipulation, so we 
can view it as a new discipline that complements conventional artificial intelligence 
(AI) approaches, and vice versa. For instance, Figure 1.1 illustrates a situation 
in which a neural character recognizer and a knowledge base are used together 
to determine the meaning of a hand- written word. The neural character recognizer 
generates two possible answers “dog” and “dag,” since the middle character could be 
either an “o” or “a.” If the knowledge base provides an extra piece of information 
that the given word is related to animals, then the answer “dog” is picked up 
correctly. 

Figure 1.2 is a list of conventional AI approaches and each of the soft comput- 
ing constituents in chronological order. We discuss the features of conventional 
AI in Section 1.2.1, and those of soft computing constituents in Sections 1.2.2 
through 1.2.4. In Section 1.3, we summarize the neuro-fuzzy and soft computing 
characteristics. 

1.2.1 From Conventional AI to Computational Intelligence 

Humans usually employ natural languages in reasoning and drawing conclusions. 
Conventional AI research focuses on an attempt to mimic human intelligent behav- 
ior by expressing it in language forms or symbolic rules. Conventional AI basically 
manipulates symbols on the assumption that such behavior can be stored in sym- 
bolically structured knowledge bases. This is the so-called physical symbol system 
hypothesis [3, 5]. Symbolic systems provide a good basis for modeling human ex- 
perts in some narrow problem areas if explicit knowledge is available. Perhaps the 
most successful conventional AI product is the knowledge-based system or expert 
system (ES); it is represented in a schematic form in Figure 1.3. 

Conventional AI literature reflects earlier work on intelligent systems. Many 
AI precursors defined AI in light of their own philosophy; some representative AI 
definitions are listed next along with a couple of ES definitions. 

• “AI is the study of agents that exist in an environment and perceive and act.” 
(S. Russell and P. Norvig) [6] 

• “AI is the art of making computers do smart things.” (Waldrop) [9] 

• “AI is a programming style, where programs operate on data according to 
rules in order to accomplish goals.” (W. A. Taylor) [8] 

• “AI is the activity of providing such machines as computers with the ability 
to display behavior that would be regarded as intelligent if it were observed 
in humans.” (R. McLeod) [4] 

• “ES is a computer program using expert knowledge to attain high levels of 
performance in a narrow problem area.” (D. A. Waterman) [10] 
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Figure 1.2. A historical sketch of soft computing constituents and conventional AI 
approaches. 


• “ES is a caricature of the human expert, in the sense that it knows almost 
everything about almost nothing.” (A. R. Mirzai) [2] 

These definitions provide a conspicuous AI framework although they may be some- 
what ephemeral because the conceptual framework is metamorphosing rapidly. The 
reader may well wonder, “Has AI become obsolete already?” 

Calling soft computing constituents “parts of modern AI” inevitably depends 
on personal judgment. It is true that today many books on modern AI describe 
neural networks and perhaps other soft computing components, as seen in [6, 11]. 
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Figure 1.3. An expert system: one of the most successful (conventional) AI prod- 
ucts. 


This means that the AI field is steadily expanding; the boundary between AI and 
soft computing is becoming indistinct and, obviously, successive generations of AI 
methodologies will be growing more sophisticated. Further discussion of these philo- 
sophical AI territories [7] is beyond the scope of this book. 

In practice, the symbolic manipulations limit the situations to which the conven- 
tional AI theories can be applied because knowledge acquisition and representation 
are by no means easy, but are arduous tasks. More attention has been directed 
toward biologically inspired methodologies such as brain modeling, evolutionary al- 
gorithms, and immune modeling; they simulate biological mechanisms responsible 
for generating natural intelligence. These methodologies are somewhat orthogonal 
to conventional AI approaches and generally compensate for the shortcomings of 
symbolicism. 

The long-term goal of AI research is the creation and understanding of machine 
intelligence. From this perspective, soft computing shares the same ultimate goal 
with AI. Figure 1.4 is a schematic representation of an intelligent system that can 
sense its environment (perceive) and act on its perception (react). An easy extension 
of ES may also result in the same ideal computationally intelligent system sought 
by soft computing researchers. Soft computing is apparently evolving under AI 
influences that sprang from cybernetics (the study of information and control in 
humans and machines). 
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Figure 1.4. An intelligent system. 


1.2.2 Neural Networks 

The human brain is a source of natural intelligence and a truly remarkable parallel 
computer. The brain processes incomplete information obtained by perception at 
an incredibly rapid rate. Nerve cells function about 10 6 times slower than electronic 
circuit gates, but human brains process visual and auditory information much faster 
than modern computers. 

Inspired by biological nervous systems, many researchers, especially brain model- 
ers, have been exploring artificial neural networks, a novel nonalgorithmic approach 
to information processing. They model the brain as a continuous-time nonlinear 
dynamic system in connectionist architectures that are expected to mimic brain 
mechanisms to simulate intelligent behavior. Such connectionism replaces sym- 
bolically structured representations with distributed representations in the form of 
weights between a massive set of interconnected neurons (or processing units). It 
does not need critical decision flows in its algorithms. 

A variety of connectionist approaches have been studied; some representative 
methodologies and their computational capacities are discussed in subsequent chap- 
ters. 


1.2.3 Fuzzy Set Theory 

The human brain interprets imprecise and incomplete sensory information provided 
by perceptive organs. Fuzzy set theory provides a systematic calculus to deal with 
such information linguistically, and it performs numerical computation by using 
linguistic labels stipulated by membership functions. Moreover, a selection of fuzzy 
if-then rules forms the key component of a fuzzy inference system (FIS) that can 
effectively model human expertise in a specific application. 

Although the fuzzy inference system has a structured knowledge representation 
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in the form of fuzzy if-then rules, it lacks the adaptability to deal with changing 
external environments. Thus, we incorporate neural network learning concepts in 
fuzzy inference systems, resulting in neuro-fuzzy modeling , a pivotal technique in 
soft computing. 

We discuss fuzzy sets, fuzzy rules, and fuzzy inference systems in Chapters 2, 3, 
and 4. Approaches to neuro-fuzzy modeling are described in Chapters 12 and 13. 

1.2.4 Evolutionary Computation 

Natural intelligence is the product of millions of years of biological evolution. Simu- 
lating complex biological evolutionary processes may lead us to discover how evolu- 
tion propels living systems toward higher-level intelligence. Greater attention is thus 
being paid to evolutionary computing techniques such genetic algorithms (GAs), 
which are based on the evolutionary principle of natural selection. Immune model- 
ing and Artificial Life are similar disciplines and are based on the assumption that 
chemical and physical laws may be able to explain living intelligence. In particu- 
lar, Artificial Life, an inclusive paradigm, attempts to realize lifelike behavior by 
imitating the processes that occur in the development or mechanics of life [1]. 

Heuristically informed search techniques are employed in many AI applications. 
When a search space is too large for an exhaustive (blind, brute-force) search and 
it is difficult to identify knowledge that can be applied to reduce the search space, 
we have no choice but to use other, more efficient search techniques to find less- 
than-optimum solutions. The GA is a candidate technique for this purpose; it offers 
the capacity for population-based systematic random searches. Simulated annealing 
and random search are other candidates that explore the search space in a stochastic 
manner. Those optimization methods are discussed in Chapter 7. 

1.3 NEURO-FUZZY AND SOFT COMPUTING CHARACTER- 
ISTICS 

With neuro-fuzzy modeling as a backbone, the characteristics of soft computing can 
be summarized as follows: 

Human expertise Soft computing utilizes human expertise in the form of fuzzy 
if-then rules, as well as in conventional knowledge representations, to solve 
practical problems. 

Biologically inspired computing models Inspired by biological neural networks, 
artificial neural networks are employed extensively in soft computing to deal 
with perception, pattern recognition, and nonlinear regression and classifica- 
tion problems. 

New optimization techniques Soft computing applies innovative optimization 
methods arising from various sources; they are genetic algorithms (inspired by 
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the evolution and selection process), simulated annealing (motivated by ther- 
modynamics), the random search method, and the downhill Simplex method. 
These optimization methods do not require the gradient vector of an objec- 
tive function, so they are more flexible in dealing with complex optimization 
problems. 

Numerical computation Unlike symbolic AI, soft computing relies mainly on nu- 
merical computation. Incorporation of symbolic techniques in soft computing 
is an active research area within this field. 

New application domains Because of its numerical computation, soft comput- 
ing has found a number of new application domains besides that of AI ap- 
proaches. These application domains are mostly computation intensive and 
include adaptive signal processing, adaptive control, nonlinear system identi- 
fication, nonlinear regression, and pattern recognition. 

Model-free learning Neural networks and adaptive fuzzy inference systems have 
the ability to construct models using only target system sample data. Detailed 
insight into the target system helps set up the initial model structure, but it 
is not mandatory. 

Intensive computation Without assuming too much background knowledge of 
the problem being solved, neuro-fuzzy and soft computing rely heavily on 
high-speed number-crunching computation to find rules or regularity in data 
sets. This is a common feature of all areas of computational intelligence. 

Fault tolerance Both neural networks and fuzzy inference systems exhibit fault 
tolerance. The deletion of a neuron in a neural network, or a rule in a fuzzy 
inference system, does not necessarily destroy the system. Instead, the sys- 
tem continues performing because of its parallel and redundant architecture, 
although performance quality gradually deteriorates. 

Goal driven characteristics Neuro-fuzzy and soft computing are goal driven; 
the path leading from the current state to the solution does not really matter 
as long as we are moving toward the goal in the long run. This is particularly 
true when used with derivative-free optimization schemes, such as genetic 
algorithms, simulated annealing, and the random search method. Domain- 
specific knowledge helps reduces the amount of computation and search time, 
but it is not a requirement. 

Real-world applications Most real-world problems are large scale and inevitably 
incorporate built-in uncertainties; this precludes using conventional approaches 
that require detailed description of the problem being solved. Soft computing 
is an integrated approach that can usually utilize specific techniques within 
subtasks to construct generally satisfactory solutions to real-world problems. 
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The field of soft computing is evolving rapidly; new techniques and applications 
are constantly being proposed. We can see that a firm foundation for soft computing 
is being built through the collective efforts of researchers in various disciplines all 
over the world. The underlying driving force is to construct highly automated, 
intelligent machines for a better life tomorrow, which is already just around the 
corner. 
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Chapter 2 


Fuzzy Sets 


J.-S. R. Jang 

This chapter introduces the basic definitions, notation, and operations for fuzzy 
sets that will be needed in the following chapters. Since research on fuzzy sets and 
their applications has been underway for almost 30 years now, it is impossible to 
cover all aspects of current developments in this field. Therefore, the aim of this 
chapter is to provide a concise introduction to and a summary of the basic concepts 
central to the study of fuzzy sets. Detailed treatments of specific subjects can be 
found in the reference list at the end of this chapter. 

2.1 INTRODUCTION 

A classical set is a set with a crisp boundary. For example, a classical set A of real 
numbers greater than 6 can be expressed as 

A = {x | x > 6}, (2.1) 

where there is a clear, unambiguous boundary 6 such that if x is greater than this 
number, then x belongs to the set A; otherwise x does not belong to the set. Al- 
though classical sets are suitable for various applications and have proven to be an 
important tool for mathematics and computer science, they do not reflect the na- 
ture of human concepts and thoughts, which tend to be abstract and imprecise. As 
an illustration, mathematically we can express the set of tall persons as a collection 
of persons whose height is more than 6 ft; this is the set denoted by Equation (2.1), 
if we let A = “tall person” and x = “height.” Yet this is an unnatural and inad- 
equate way of representing our usual concept of “tall person.” For one thing, the 
dichotomous nature of the classical set would classify a person 6.001 ft tall as a tall 
person, but not a person 5.999 ft tall. This distinction is intuitively unreasonable. 
The flaw comes from the sharp transition between inclusion and exclusion in a set. 

In contrast to a classical set, a fuzzy set, as the name implies, is a set without 
a crisp boundary. That is, the transition from “belong to a set” to “not belong to a 
set” is gradual, and this smooth transition is characterized by membership functions 
that give fuzzy sets flexibility in modeling commonly used linguistic expressions, 
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such as “the water is hot” or “the temperature is high.” As Zadeh pointed out in 
1965 in his seminal paper entitled “Fuzzy Sets” [11], such imprecisely defined sets 
or classes “play an important role in human thinking, particularly in the domains 
of pattern recognition, communication of information, and abstraction.” Note that 
the fuzziness does not come from the randomness of the constituent members of 
the sets, but from the uncertain and imprecise nature of abstract thoughts and 
concepts. 

Let us now set forth several basic definitions concerning fuzzy sets. 

2.2 BASIC DEFINITIONS AND TERMINOLOGY 

Let X be a space of objects and re be a generic element of X. A classical set A , 
A C X, is defined as a collection of elements or objects iGl, such that each x can 
either belong or not belong to the set A. By defining a characteristic function 
for each element re in X, we can represent a classical set A by a set of ordered pairs 
(re, 0) or (rc, 1), which indicates x £ A or x € A, respectively. 

Unlike the aforementioned conventional set, a fuzzy set [11] expresses the degree 
to which an element belongs to a set. Hence the characteristic function of a fuzzy set 
is allowed to have values between 0 and 1, which denotes the degree of membership 
of an element in a given set. 

Definition 2.1 Fuzzy sets and membership functions 

If X is a collection of objects denoted generically by rc, then a fuzzy set A in X is 
defined as a set of ordered pairs: 

A = {(rc,^(rc)) | rc 6 X}, (2.2) 

where ha (x) is called the membership function (or MF for short) for the fuzzy 
set A. The MF maps each element of X to a membership grade (or membership 
value) between 0 and 1. 


□ 

Obviously, the definition of a fuzzy set is a simple extension of the definition of 
a classical set in which the characteristic function is permitted to have any values 
between 0 and 1. If the value of the membership function pa(x) is restricted to 
either 0 or 1, then A is reduced to a classical set and jjla(x) is the characteristic 
function of A. For clarity, we shall also refer to classical sets as ordinary sets, crisp 
sets, nonfuzzy sets, or just sets. 

Usually X is referred to as the universe of discourse, or simply the universe, 
and it may consist of discrete (ordered or nonordered) objects or continuous space. 
This can be clarified by the following examples. 

Example 2.1 Fuzzy sets with a discrete nonordered universe 
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(a) MF on a Discrete Universe 



(b) MF on a Continuous Universe 



X = Age 


Figure 2.1. (a) A = “sensible number of children in a family (b) B = “about 50 
years old.” (MATLAB file: mf_univ.m) 


Let X = {San Francisco, Boston, Los Angeles} be the set of cities one may choose 
to live in. The fuzzy set C = “desirable city to live in” may be described as follows: 

C = {(San Francisco, 0.9), (Boston, 0.8), (Los Angeles, 0.6)}. 

Apparently the universe of discourse X is discrete and it contains nonordered 
objects — in this case, three big cities in the United States. As one can see, the 
foregoing membership grades listed above are quite subjective; anyone can come up 
with three different but legitimate values to reflect his or her preference. 

□ 


Example 2.2 Fuzzy sets with a discrete ordered universe 

Let X = {0, 1, 2, 3, 4, 5, 6} be the set of numbers of children a family may choose 
to have. Then the fuzzy set A = “sensible number of children in a family” may be 
described as follows: 

A = {(0,0.1), (1,0.3), (2,0.7), (3, 1), (4, 0.7), (5, 0.3), (6,0.1)}. 

Here we have a discrete ordered universe X; the MF for the fuzzy set A is shown 
in Figure 2.1(a). Again, the membership grades of this fuzzy set are obviously 
subjective measures. 

□ 


Example 2.3 Fuzzy sets with a continuous universe 

Let X = R + be the set of possible ages for human beings. Then the fuzzy set B = 
“about 50 years old” may be expressed as 

B = {{x,ii B {x)\x e X}, 
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where 



This is illustrated in Figure 2.1(b). 


□ 

From the preceding examples, it is obvious that the construction of a fuzzy 
set depends on two things: the identification of a suitable universe of discourse 
and the specification of an appropriate membership function. The specification of 
membership functions is subjective , which means that the membership functions 
specified for the same concept (say, “sensible number of children in a family”) by 
different persons may vary considerably. This subjectivity comes from individual 
differences in perceiving or expressing abstract concepts and has little to do with 
randomness. Therefore, the subjectivity and nonrandomness of fuzzy sets is the 
primary difference between the study of fuzzy sets and probability theory, which 
deals with objective treatment of random phenomena. 

For simplicity of notation, we now introduce an alternative way of denoting a 
fuzzy set. A fuzzy set A can be denoted as follows: 

_ ( J2 Xi £X M A{xi)/xi , if A is a collection of discrete objects. 

~ y J x fiA.{ x )/ x , if A is a continuous space (usually the real line R). 

(2 - 3) 

The summation and integration signs in Equation (2.3) stand for the union of 
pairs; they do not indicate summation or integration. Similarly, “/” is 
only a marker and does not imply division. 

Example 2.4 Alternative expression 

Using the notation of Equation (2.3), we can rewrite the fuzzy sets in Examples 2.1, 
2.2, and 2.3 as 

C = 0.9/San Francisco + 0.8/Boston + 0.6/Los Angeles, 

A = 0.1/0 + 0.3/1 + 0.7/2 + 1.0/3 + 0.7/4 + 0.3/5 + 0.1/6, 

and 

B = L i + ir)V ’ 

respectively. 


□ 

In practice, when the universe of discourse X is a continuous space (the real line 
R or its subset), we usually partition X into several fuzzy sets whose MFs cover 
X in a more or less uniform manner. These fuzzy sets, which usually carry names 
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Figure 2.2. Typical MFs of linguistic values “young,” “middle aged,” and “old.” 
(MATLAB file: lingmf.m) 


that conform to adjectives appearing in our daily linguistic usage, such as “large,” 
“medium,” or “small,” are called linguistic values or linguistic labels. Thus, the 
universe of discourse X is often called the linguistic variable. Formal definitions of 
linguistic variables and linguistic values are given in the next chapter; here we shall 
give only a simple example. 

Example 2.5 Linguistic variables and linguistic values 

Suppose that X = “age.” Then we can define fuzzy sets “young,” “middle aged,” 
and “old” that are characterized by MFs p, 0 id(x), Hmiddiea 9 ed(x), and p 0 id{x), re- 
spectively. Just as a variable can assume various values, a linguistic variable “Age” 
can assume different linguistic values, such as “young,” “middle aged,” and “old” in 
this case. If “age” assumes the value of “young,” then we have the expression “age 
is young,” and so forth for the other values. Typical MFs for these linguistic values 
are displayed in Figure 2.2, where the universe of discourse X is totally covered by 
the MFs and the transition from one MF to another is smooth and gradual. 


□ 

A fuzzy set is uniquely specified by its membership function. To describe mem- 
bership functions more specifically, we shall define the nomenclature used in the 
literature. (Unless otherwise specified, we shall assume that the universe of the 
fuzzy sets under discussion is the real line R or its subset.) 

Definition 2.2 Support 

The support of a fuzzy set A is the set of all points x in X such that pa(x) > 0: 

support(A) = {x\pa(x) > 0}. (2.4) 

□ 


Definition 2.3 Core 
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The core of a fuzzy set A is the set of all points x in X such that pa{x) 

core(A) = {x| ha{x) = !}• 


1 : 

(2.5) 


□ 


Definition 2.4 Normality 

A fuzzy set A is normal if its core is nonempty. In other words, we can always find 
a point x G X such that pa{x) = 1. 

□ 


Definition 2.5 Crossover points 

A crossover point of a fuzzy set A is a point x € X at which pa{x) = 0.5: 

crossover(A) = {x\pa{x) = 0.5}. (2-6) 

□ 


Definition 2.6 Fuzzy singleton 

A fuzzy set whose support is a single point in X with pa{%) = 1 is called a fuzzy 
singleton. 


□ 

Figures 2.3(a) and 2.3(b) illustrate the cores, supports, and crossover points of 
the bell-shaped membership function representing “middle aged” and of the fuzzy 
singleton characterizing “45 years old.” 

Definition 2.7 a-cut, strong a- cut 

The a-cut or a-level set of a fuzzy set A is a crisp set defined by 

A q = {x\ p A {x) > a}. (2.7) 

Strong a-cut or strong a-level set are defined similarly: 

A' a = (arl/i^ar) > a}. (2.8) 

□ 

Using the notation for a level set, we can express the support and core of a fuzzy 
set A as 

support (A) = Aq, 

and 


respectively. 


core( A) = A \ , 



Sec. 2.2. Basic Definitions and Terminology 


19 


Membership Grades 



Membership Grades 


1.0 


45 Year Old 


0.5\ 



Core and Support 


(b) 


Figure 2.3. Cores, supports, and crossover points of (a) the fuzzy set “middle 
aged” and (b) the fuzzy singleton “45 years old.” 

Definition 2.8 Convexity 

A fuzzy set A is convex if and only if for any x\, x<i € X and any A € [0,1], 

Pa{ Azi + (1 - A)x 2 ) > mm{p A {xi), pa(x 2 )}- (2.9) 

Alternatively, A is convex if all its a-level sets are convex. 

□ 

A crisp set C in R n is convex if and only if for any two points Xi G C and 
X 2 € C, their convex combination Xxi + (1 — A)x 2 is still in C, where 0 < A < 1. 
Hence the convexity of a (crisp) level set A a implies that A a is composed of a single 
line segment only. 

Note that the definition of convexity of a fuzzy set is not as strict as the common 
definition of convexity of a function. For comparison, the definition of convexity of 
a function f{x) is 


/( Axi + (1 - A)x 2 ) > A/(xi) + (1 - A )/(ac 2 ), 


( 2 . 10 ) 
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(b) A Nonconvex Fuzzy Set 



Figure 2.4. (a) Two convex membership functions ; (b) a nonconvex membership 
function. (MATLAB file: convexmf.m) 


which is a tighter condition than Equation (2.9). 

Figure 2.4 illustrates the concept of convexity of fuzzy sets; Figure 2.4(a) shows 
two convex fuzzy sets [the left fuzzy set satisfies both Equations (2.9) and (2.10, 
while the right one satisfies Equation (2.9) only]; Figure 2.4(b) is a nonconvex fuzzy 
set. 

Definition 2.9 Fuzzy numbers 

A fuzzy number A is a fuzzy set in the real line ( R ) that satisfies the conditions for 
normality and convexity. 


□ 

Most (noncomposite) fuzzy sets used in the literature satisfy the conditions for 
normality and convexity, so fuzzy numbers are the most basic type of fuzzy sets. 

Definition 2.10 Bandwidths of normal and convex fuzzy sets 

For a normal and convex fuzzy set, the bandwidth or width is defined as the 
distance between the two unique crossover points: 

width(A) = |#2 _ £i|, (2-11) 


where pa(xi) = P>a(x 2 ) = 0.5. 


□ 


Definition 2.11 Symmetry 

A fuzzy set A is symmetric if its MF is symmetric around a certain point x = c, 
namely, 


/. ia(c + x) — ha(c — x ) for all x G X. 
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□ 


Definition 2.12 Open left , open right , closed 

A fuzzy set A is open left if limz-^-oo P>a{x) = 1 and lim a; _ ) . +00 pa(x) = 0; open 
right if lims-^-oo iia(x) = 0 and lim^+oo jaa(x) = 1; and closed if lima^-oo pa(x) = 
lima-H-oc Au(s) = 0. 

□ 

For instance, the fuzzy set “young” in Figure 2.2 is open left; “old” is open right; 
and “middle aged” is closed. 

2.3 SET-THEORETIC OPERATIONS 

Union, intersection, and complement are the most basic operations on classical sets. 
On the basis of these three operations, a number of identities can be established, 
as listed in Table 2.1. These identities can be verified using Venn diagrams. 

Corresponding to the ordinary set operations of union, intersection, and com- 
plement, fuzzy sets have similar operations, which were initially defined in Zadeh’s 
seminal paper [11]. Before introducing these three fuzzy set operations, first we 
shall define the notion of containment, which plays a central role in both ordinary 
and fuzzy sets. This definition of containment is, of course, a natural extension of 
the case for ordinary sets. 

Definition 2.13 Containment or subset 

Fuzzy set A is contained in fuzzy set B (or, equivalently, A is a subset of B> or 
A is smaller than or equal to B ) if and only if iia(x) < Vb(x) for all x. In symbols, 

AC B <=> ha(x) < a ib(x)- (2-12) 

□ 


Figure 2.5 illustrates the concept of A C B. 

Definition 2.14 Union (disjunction) 

The union of two fuzzy sets A and B is a fuzzy set C, written as C = A U B or 
C = A OR B, whose MF is related to those of A and B by 

He (x) = max(/. ia (x) , i u B (x) ) = ha (x) V (x) . (2. 13) 

□ 

As pointed out by Zadeh [11], a more intuitive but equivalent definition of union 
is the “smallest” fuzzy set containing both A and B. Alternatively, if D is any fuzzy 
set that contains both A and B, then it also contains A U B. The intersection of 
fuzzy sets can be defined analogously. 
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Table 2.1. Basic identities of classical sets, where A, B, and C are crisp sets; 
A, B, and C are their corresponding complements ; X is the universe; and 0 is the 
empty set. 


Law of contradiction 

ADA = 0 

Law of the excluded middle 

AuA = X 

Idempotency 

AnA = A, AU A — A 

Involution 

A = A 

Commutativity 

AnB = Bn A, AUB = BUA 

Associativity 

(A U B) U C = A U (B U C) 
(AnB) n.c = An(BnC) 

Distributivity 

A U (B n C) = {A U B) n (A U C) 

A n {B u C) = {A n B) u (A n c) 

Absorption 

Au (An B) = A 

A n (A U B) = A 

Absorption of 

An (An B) = AUB 

complement 

An (Au B) = An B 

DeMorgan’s laws 

~Aub — AnB 

AnB = AUB 


Definition 2.15 Intersection (conjunction) 

The intersection of two fuzzy sets A and B is a fuzzy set C, written as C = An B 
or C = A AND B, whose MF is related to those of A and B by 

P>c(x) = min (p, A (x),p, B (x)) = Va(x) A p B (x). (2.14) 

□ 

As in the case of the union, it is obvious that the intersection of A and B is 
the “largest” fuzzy set which is contained in both A and B. This reduces to the 
ordinary intersection operation if both A and B are nonfuzzy. 

Definition 2.16 Complement (negation) 

The complement of fuzzy set A, denoted by A ( ->A , NOT A), is defined as 

p-j(x) = 1 - pa{x). 


(2.15) 
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A Is Contained in B 



Figure 2.5. The concept of AC B. (MATLAB file: subset, m) 
(a) Fuzzy Sets A and B (b) Fuzzy Set 'not A* 




(c) Fuzzy Set 'A OR B' (d) Fuzzy Set 'A AND B' 




Figure 2.6. Operations on fuzzy sets: (a) two fuzzy sets A and B; (b) A; (c) A(JB; 
(d) Ad B. (MATLAB file: fuzsetop.m) 


□ 

Figure 2.6 demonstrates these three basic operations: Figure 2.6(a) illustrates 
two fuzzy sets A and B; Figure 2.6(b) is the complement of A; Figure 2.6(c) is the 
union of A and B; and Figure 2.6(d) is the intersection of A and B. 

Equations (2.13), (2.14), and (2.15) perform exactly as the corresponding op- 
erations for ordinary sets if the values of the membership functions are restricted 
to either 0 or 1. However, it is understood that these functions are not the only 
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possible generalizations of the crisp set operations. For each of the aforementioned 
three set operations, several different classes of functions with desirable properties 
have been proposed subsequently in the literature; we will introduce some of these 
functions in Section 2.5. The appropriateness of these functions can be checked via 
the identities in Table 2.1 (see Exercises 3 and 4). For distinction, the max [Equa- 
tion (2.13)], min [Equation (2.14)], and the complement operator [Equation (2.15)] 
will be referred to as the classical or standard fuzzy operators for intersection, 
union, and negation, respectively, on fuzzy sets. 

Next we define other operations on fuzzy sets which axe also direct generaliza- 
tions of operations on ordinary sets. 

Definition 2.17 Cartesian product and co-product 

Let A and B be fuzzy sets in X and Y, respectively. The Cartesian product of 
A and B, denoted by A x B, is a fuzzy set in the product space X x Y with the 
membership function 


VAxb{x,v) = min (n A (x), p, B {y))- (2.16) 

Similarly, the Cartesian co-product A + B is a fuzzy set with the membership 
function 

Ha+b{x, V ) = ma x(/m(x), /i B (y)). (2.17) 

Both AxB and A+B are characterized by two-dimensional MFs, which are explored 
in greater detail in Section 2.4.2. 

2.4 MF FORMULATION AND PARAMETERIZATION 

As mentioned earlier, a fuzzy set is completely characterized by its MF. Since most 
fuzzy sets in use have a universe of discourse X consisting of the real line R, it 
would be impractical to list all the pairs defining a membership function. A more 
convenient and concise way to define an MF is to express it as a mathematical 
formula, as in Example 2.3. In this section we describe the classes of parameterized 
functions commonly used to define MFs of one and two dimensions. MFs of higher 
dimensions can be defined similarly. Moreover, we give the derivatives of some 
of the MFs with respective to their inputs and parameters. These derivatives are 
important for fine-tuning a fuzzy inference system to achieve a desired input/output 
mapping; techniques for fine-tuning fuzzy inference systems are discussed in detail 
in Chapter 4. 

2.4.1 MFs of One Dimension 

First we define several classes of parameterized MFs of one dimension — that is, MFs 
with a single input. 

Definition 2.18 Triangular MFs 
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A triangular MF is specified by three parameters {a, 6, c} as follows: 


triangle (x; a, 6, c) 


< 


v 


0 , 

x — a 
b — a ’ 
c — x 
c — b ’ 
0 , 


x < a. 
a < x <b. 

b < x < c. 
c < x. 


(2.18) 


By using min and max, we have an alternative expression for the preceding equation: 


triangle(x; a, b, c) = max 




(2.19) 


The parameters {a, 6, c} (with a <b <c) determine the x coordinates of the three 
corners of the underlying triangular MF. 


□ 


Figure 2.7(a) illustrates a triangular MF defined by triangle(x; 20, 60, 80). 
Definition 2.19 Trapezoidal MFs 


A trapezoidal MF is specified by four parameters {a, 6, c, d } as follows: 


trapezoid(x; a, 6, c, d) 


0 , 

x — a 
b — a ’ 
< 1 , 
d — x 
d — c’ 
0 , 


x < a. 
a < x <b. 
b < x < c. 
c < x < d. 
d < x. 


An alternative concise expression using min and max is 


trapezoid(x; a , 6, c, d) = max 




( 2 . 20 ) 


( 2 . 21 ) 


The parameters {a, 6, c, d} (with a <b < c < d) determine the x coordinates of the 
four corners of the underlying trapezoidal MF. 


□ 

Figure 2.7(b) illustrates a trapezoidal MF defined by trapezoid(x; 10, 20, 60, 95). 
Note that a trapezoidal MF with parameter {a, 6, c, d} reduces to a triangular MF 
when b is equal to c. 

Due to their simple formulas and computational efficiency, both triangular MFs 
and trapezoidal MFs have been used extensively, especially in real-time implemen- 
tations. However, since the MFs are composed of straight line segments, they are 
not smooth at the corner points specified by the parameters. In the following we 
introduce other types of MFs defined by smooth and nonlinear functions. 
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(a) Trianguler MF (b) Trapezoidal MF 



(c) Gaussian MF 



(d) Generalized Bell MF 



Figure 2.7. Examples of four classes of parameterized MFs: (a) 

triangle(x-,20, 60, 80); (b) trapezoid{x\ 10,20,60,95); (c) gaussian(x] 50,20); (d) 
bell{x\ 20, 4, 50). (MATLAB file: dispjnf .m) 


Definition 2.20 Gaussian MFs 


A Gaussian MF is specified by two parameters {c,er}: 


gaussian(x; c, rr) = e 



( 2 . 22 ) 


□ 

A Gaussian MF is determined completely by c and a\ c represents the MFs 
center and a determines the MFs width. Figure 2.7(c) plots a Gaussian MF defined 
by gaussian(x; 50, 20). 

Definition 2.21 Generalized bell MFs 


A generalized bell MF (or bell MF) is specified by three parameters { a,b,c }: 


bell(x; a, 6, c) 


1 


1 + 



26’ 


(2.23) 


where the parameter b is usually positive. (If 6 is negative, the shape of this MF 
becomes an upside-down bell.) Note that this MF is a direct generalization of 
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(a) Changing ’a’ 


(b) Changing ’b’ 



Figure 2.8. The effects of changing parameters in bell MFs: (a) changing param- 
eter a; (b) changing parameter b; (c) changing parameter c ; (d) changing a and b 
simultaneously but keeping their ratio constant. (MATLAB file: allbells. m) 


the Cauchy distribution used in probability theory, so it is also referred to as the 

Cauchy MF. 


□ 

Figure 2.7(d) illustrates a generalized bell MF defined by bell(r; 20, 4, 50). A 
desired generalized bell MF can be obtained by a proper selection of the parameter 
set {a,b,c}. Specifically, we can adjust c and a to vary the center and width of 
the MF, and then use b to control the slopes at the crossover points. Figure 2.9 
shows the physical meanings of each parameter in a bell MF. Figure 2.8 further 
illustrates the effects of changing each parameter. To obtain hands-on experience of 
these effects, the reader is encouraged to run the MATLAB file bellmanu .m, which is 
available via FTP and WWW (see page xxiii). Another file bellanim.m gives more 
vivid visual effects by animating two bell MFs when their parameters are changing. 

Because of their smoothness and concise notation, Gaussian and bell MFs are 
becoming increasingly popular for specifying fuzzy sets. Gaussian functions are 
well known in probability and statistics, and they possess useful properties such as 
invariance under multiplication (the product of two Gaussians is a Gaussian with a 
scaling factor) and Fourier transform (the Fourier transform of a Gaussian is still a 
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Figure 2.9. Physical meaning of parameters in a generalized bell MF. 


Gaussian). The bell MF has one more parameter than the Gaussian MF, so it has 
one more degree of freedom to adjust the steepness at the crossover points. 

Although the Gaussian MFs and bell MFs achieve smoothness, they are unable 
to specify asymmetric MFs, which are important in certain applications. Next we 
define the sigmoidal MF, which is either open left or right. Asymmetric and close 
MFs can be synthesized using either the absolute difference or the product of two 
sigmoidal functions, as explained next. 

Definition 2.22 Sigmoidal MFs 


A sigmoidal MF is defined by 

Si6( * ; °’ c) = l+exp[-o(x-c)]’ 
where a controls the slope at the crossover point x — c. 


(2.24) 


□ 

Depending on the sign of the parameter a, a sigmoidal MF is inherently open 
right or left and thus is appropriate for representing concepts such as “very large” 
or “very negative.” Sigmoidal functions of this kind are employed widely as the 
activation function of artificial neural networks. Therefore, for a neural network to 
simulate the behavior of a fuzzy inference system (more on this in later chapters), 
the first problem we face is how to synthesize a close MF through a sigmoidal 
function. Two simple ways for achieving this are shown in the following example. 

Example 2.6 Close and asymmetric MFs based on sigmoidal functions 

Figure 2.10(a) shows two sigmoidal functions yi = sig(x; 1, —5) and y% = sig(r; 2, 5); 
a close and asymmetric MF can be obtained by taking their difference \yi — 2/2 U as 
shown in Figure 2.10(b). Figure 2.10(c) shows an additional sigmoidal MF defined 
as j/3 = sig(r; —2, 5); another way to form a close and asymmetric MF is to take 
their product 2/12/3 , as shown in Figure 2.10(d). The reader is encouraged to try 
the MATLAB file siganim.m (available via FTP and WWW, see page xxiii), which 
display the animation of two composite MFs based on sigmoidal functions. 
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(a) yl = sig(x;1 ,-5); y2 = sig(x;2,5) 


(b) lyl - y2l 




(c) yl = sig(x;1,-5); y3 = sig(x;-2,5) 




Figure 2.10. (a) Two sigmoidal functions y\ and y?; (b) a close MF obtained from 
\yi — 2 / 2 1; (c) two sigmoidal functions y\ and y$; (d) a close MF obtained from y\y-$. 
(MATLAB file: disp_sig.m) 


□ 

In the following we define a much more general type of MF, the left-right MF. 
This type of MF, although extremely flexible in specifying fuzzy sets, is not used 
often in practice because of its unnecessary complexity. 

Definition 2.23 Left-right MF (L-R MF) 

A left-right MF or L-R MF is specified by three parameters {a,/?,c}: 


LR(x; c, a,/3) 


FlI^L x<c. 
Fr (^h) > x>c, 


(2.25) 


where Fl{x) and Fr{x) are monotonically decreasing functions defined on [0, oo) 
with F l ( 0) = F r (0) = 1 and lim x _+ 00 F L {x) = lim x _+oo Fr(x) = 0. 


□ 


Example 2.7 L-R MF 
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(a) (b) 



Figure 2.11. Two L-R MFs: (a) LR(x; 65,60,10); (b) LR(a ;; 25, 10, 40). 
(MATLAB file: difflr.m) 

Let 

Fl(x) = max(0, y/l~—a?), 

Fr(x) = e"'< 

Based on the preceding Fl(x) and Fr(x), Figure 2.11 illustrates two L-R MFs 
specified by LR(x; 65, 60, 10) and LR(x; 25, 10, 40). 

□ 

The list of MFs introduced in this section is by no means exhaustive; other 
specialized MFs can be created for specific applications if necessary. In particular, 
any type of continuous probability distribution functions can be used as an MF 
here, provided that a set of parameters is given to specify the appropriate meanings 
of the MF. 

Several other types of parameterized MFs, such as S, Z, n, and two-sided Gaus- 
sian MFs, are examined in Exercises 6, 7, 8, and 10, respectively. 

2.4.2 MFs of Two Dimensions 

Sometimes it is advantageous or necessary to use MFs with two inputs, each in 
a different universe of discourse. MFs of this kind axe generally referred to as 
two-dimensional MFs, whereas ordinary MFs (MFs with one input) are referred 
to as one-dimensional MFs. One natural way to extend one-dimensional MFs to 
two-dimensional ones is via cylindrical extension, defined next. 

Definition 2.24 Cylindrical extensions of one-dimensional fuzzy sets 

If A is a fuzzy set in X , then its cylindrical extension in X x Y is a fuzzy set 
c(A ) defined by 

c(A) = / HA(x)/(x,y). 

JXxY 


(2.26) 
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(a) Base Fuzzy Set A 




Figure 2.12. (a) Base set A; (b) its cylindrical extension c(A). (MATLAB file: 
cyl.ext . m) 


(Usually A is referred to as a base set.) 


□ 


The concept of cylindrical extension is quite straightforward; it is illustrated in 
Figure 2.12. The operation of projection, on the other hand, decreases the dimension 
of a given (multidimensional) membership function. 

Definition 2.25 Projections of fuzzy sets 


Let R be a two-dimensional fuzzy set on X x Y. Then the projections of R onto 
X and Y are defined as 


and 


respectively. 


Rx= [ma xp R (x,y)]/x 

Jx v 

Ry = I [ma xfi R (x,y)\/y, 
Jy x 


□ 

Figure 2.13(a) shows the MF for fuzzy set R\ Figure 2.13(b) and Figure 2.13(c) 
axe the projections of R onto X and Y, respectively. 

Generally speaking, MFs of two dimensions fall into two categories: composite 
and noncomposite. If an MF of two dimensions can be expressed as an analytic 
expression of two MFs of one dimension, then it is composite; otherwise it is non- 
composite. An example is given next. 

Example 2.8 Composite and noncomposite MFs 
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Figure 2.13. (a) Two-dimensional fuzzy set R; (b) Rx (projection of R onto X); 
and (c) Ry (projection of R onto Y). (MATLAB file: project.m) 


Suppose that fuzzy set A — “(x, y) is near (3, 4)” 


p A (x,y) — exp 




is defined by 
(V - 4) 2 1 • 


Then this two-dimensional MF is composite, since it can be decomposed into two 
Gaussian MFs: 


VA{x,y) = exp 


-(V)' 


exp 




= gaussian(x; 3, 2) gaussian(y ; 4,1). 


Note that we cam view the fuzzy set A as two statements joined by the connective 
AND: “x is near 3 AND y is near 4,” where the first statement is defined by 


Mneax 3( x ) = gaussian(x; 3,2), 
and the second statement is defined by 


Mnear M = gaussian(y; 4, 1). 


Thus the multiplication of these two MFs is used to interpret the AND operation 
of these two statements. 

On the other hand, if this fuzzy set is defined by 


l*A(x,y) 


1 

l + |x — 3| |y — 4| 2,5 ’ 


(2.27) 


then it is noncomposite. 


□ 

As demonstrated in the preceding example, a composite two-dimensional MF is 
usually the result of two statements joined by the AND or OR connectives. Under 
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(a) z = min(trap(x), trap(y)) 



Y -10 -10 x 


(b) z = max(trap(x), trap(y)) 



Y -10 -10 x 


(c) z = min(bell(x), bell(y)) 



Y -10 -10 x 



Figure 2.14. Two-dimensional MFs defined by the min and max operators: (a) z 
= min (trap(x), trap(y)); (b) z = ma x(trap(x), trap(y )); (c) z = min (bell(x), bell(y)); 
(d) z =ma x(bell(x),bell(y)). (MATLAB file: mf2d.m) 


this condition, the two-dimensional MF is defined as the AND or OR aggregation 
of its two constituent MFs. Classical AND and OR operations on fuzzy sets are 
min and max (see Section 2.3); their effects on generating two-dimensional MFs are 
illustrated in the following example. 

Example 2.9 Composite two-dimensional MFs based on min and max operators 

Let trap(x) = trapezoid(ar; —6, —2, 2, 6) and trap(j/) = trapezoid(i/; —6, —2, 2, 6) be 
two trapezoidal MFs on X and Y, respectively. After applying the min and max 
operators, we have two-dimensional MFs on X x Y, as shown in Figures 2.14(a) 
and (b). Figure 2.14(c) and (d) repeat the same plots, except that the trapezoidal 
MFs are replaced by bell MFs bell(a:) = bell(a:; 4, 3, 0) and bell(a:) = bell(y; 4, 3, 0). 

□ 

When the min operator is used to aggregate one-dimensional MFs, the resulting 
two-dimensional MF can be viewed either as the result of applying classical fuzzy 
intersection (Definition 2.15) to the cylindrical extensions of each one-dimensional 
MF, or as a Cartesian product of two one-dimensional fuzzy sets (Definition 2.17). 
Similar interpretations apply to the max operator. 
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It is obvious that the definitions for one-dimensional MFs introduced in Sec- 
tion 2.2 have natural extensions to the case of two-dimensional MFs, with some 
appropriate adjustments. (For instance, crossover points should be changed to 
crossover curves, a-cut is a crisp set in R 2 instead of R, and so on.) Moreover, the 
concepts introduced in this section can be generalized easily to form the concepts 
of n-dimensional MFs. 


2.4.3 Derivatives of Parameterized MFs 


To make a fuzzy system adaptive, we need to know the derivatives of an MF with 
respect to its argument (input) and parameters. This derivative information plays a 
central role in the learning or adaptation of a fuzzy system, which will be discussed 
in depth in subsequent chapters. Here we list these derivatives for the Gaussian 
and bell MFs; the reader is encouraged to derive them independently. 

For the Gaussian MF, let 


Then 


y = gaussian(x; cr, c) = e ^ ( a ) . (2.28) 


dy 

dx 

da 

dy_ 

dc 


x — c 
~~^ y - 
(x - cy 


y- 



(2.29) 

(2.30) 

(2.31) 


In the preceding expressions, the derivatives are arranged to include y and thus 
save computation. For the bell MF, let 


Then 


y = bell(x; a, 6, c) = 


1 + 


x — c 
a 


2b’ 


dy^ 

dx 

dy 

da 

dy 

db 

dy 

dc 


= < 


~=rcy{l-y), if x±c. 
0, if x = c. 


— y(i -y). 

a 


-2 In 


2b 


x — c 


a 


0 , 


2/(1 - y), if x ^ c. 

if x = c. 


= { X — c 


1 “ V ), if x ^ c. 


0 , 


if x = c. 


(2.32) 

(2.33) 

(2.34) 

(2.35) 

(2.36) 


Derivation of these formulas are left as Exercises 12 and 13 
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2.5 MORE ON FUZZY UNION, INTERSECTION, AND COM- 
PLEMENT* 

This section discusses more advanced topics concerning complement, union, and 
intersection operations on fuzzy sets. For a first-time reading, this section may 
be omitted without discontinuity. Throughout this book, sections or subsections 
marked with a star (*) can be skipped at first reading. 

Although the classical fuzzy set operators [Equations (2.13], (2.14), and (2.15)) 
possess more rigorous axiomatic properties (as shown in this section), they are not 
the only ways to define reasonable and consistent operations on fuzzy sets. This 
section examines other viable definitions of the fuzzy complement, intersection, and 
union operators. 

2.5.1 Fuzzy Complement* 

A fuzzy complement operator is a continuous function N : [0, 1] — >• [0, 1] which 
meets the following axiomatic requirements: 

N( 0) — 1 and N( 1) = 0 (boundary) 

N(a) > N(b) if a <b (monotonicity). * ' ' 

All functions satisfying these requirements form the general class of fuzzy comple- 
ments. It is evident that violation of any of these requirements would add to this 
class some functions which are unacceptable as complement operators. Specifically, 
a violation of the boundary conditions would include functions that do not conform 
to the ordinary complement for crisp sets. The monotonic decreasing requirement 
is essential since we intuitively expect that an increase in the membership grade of 
a fuzzy set must result in a decrease in the membership grade of its complement. 
These two requirements are the basic requirements that a fuzzy complement op- 
erator should meet. Another optional requirement imposes involution on a fuzzy 
complement: 

N(N(a)) = a (involution), (2.38) 

which guarantees that the double complement of a fuzzy set is still the set itself. 
The following examples of fuzzy complements satisfy two of the basic requirements 
in Equation (2.37) as well as the aforementioned optional one. 

Example 2.10 Sugeno’s complement 

One class of fuzzy complements is Sugeno’s complement [8], defined by 

JV.(°) = (2-39) 

1 + sa 

where s is a parameter greater than —1. For each value of the parameter s, we 
obtain a particular fuzzy complement operator, as shown in Figure 2.15(a). 

□ 
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(a) Sugeno’s Complements (b) Yager’s Complements 




Figure 2.15. Sugeno’s and Yager’s complements. (MATLAB file: negation.m) 


Example 2.11 Yager’s complement 

Another class of fuzzy complements is Yager’s complement [10] defined by 

JV„(o) = (l-o u ’) 1/ “’, (2-40) 

where w is a positive parameter. Figure 2.15(b) demonstrates this class of func- 
tions for various values of w. Note that due to the involution requirement, both 
Sugeno’s and Yager’s complements are symmetric about the 45-degree straight line 
connecting (0, 0) and (1, 1). 

□ 

Obviously, these axiomatic requirements for fuzzy complements do not deter- 
mine N{ •) uniquely. However, N (a) is equal to 1— a (the classical fuzzy complement) 
if the following requirement is introduced [1]: 

Ma(xi) - pa{x 2 ) = Ma( X 2) - i)* (2.41) 

This requirement ensures that a change in the membership value in A should have 
a corresponding effect on the membership in A. This requirement together with 
the basic requirements (boundary and monotonicity) for fuzzy complements entails 
N(a ) = 1 — a, which automatically satisfies the involution requirement. 

2.5.2 Fuzzy Intersection and Union* 

The intersection of two fuzzy sets A and B is specified in general by a function 
T : [0, 1] x [0, 1] [0, 1], which aggregates two membership grades as follows: 

Hac\b(x) = T(pa{x),pb{x)) = pa(x) * Pb(x), (2.42) 

where * is a binary operator for the function T. This class of fuzzy intersection 
operators, which are usually referred to as T-norm (triangular norm) operators, 
meets the following basic requirements. 
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Definition 2.26 T-norm 

A T-norm operator [3] is a two-place function T(-, •) satisfying 


T( 0, 0) = 0, T(a, 1) = T(l,a) = a 

(boundary) 


T(a, b ) <T(c, d) if a < c and b < d 

(monotonicity) 

(2.43) 

T(a, b) = T(b, a) 

(commutativity) 

T(a,T(b,c))=T(T(a,b),c) 

(associativity). 

□ 


The first requirement imposes the correct generalization to crisp sets. The sec- 
ond requirement implies that a decrease in the membership values in A or B cannot 
produce an increase in the membership value in ADB. The third requirement indi- 
cates that the operator is indifferent to the order of the fuzzy sets to be combined. 
Finally, the fourth requirement allows us to take the intersection of any number of 
sets in any order of pairwise groupings. The following example illustrates four of 
the most frequently encountered T-norm operators. 

Example 2.12 Four T-norm operators 


Four of the most frequently used T-norm operators are 


Minimum: 
Algebraic product: 
Bounded product: 


b ) = min(a, b) = a A b. 

Tap^O/j 

Tb P (a , b) = 0 V (a + b — 1). 


Drastic product: Td p (a,b ) 


а, if b = 1. 

б, if a = 1. 
0, if a, b < 1 


(2.44) 


With the understanding that a and b are between 0 and 1, we can draw surface 
plots of these four T-norm operators as functions of a and 6; see the first row of 
Figure 2.16. The second row of Figure 2.16 shows the corresponding surfaces when 
a = ha{x) = trapezoid(x; 3,8, 12, 17) and b = hb(v) = trapezoid^/; 3,8, 12, 17); 
these two-dimensional MFs can be viewed as the Cartesian product of A and B 
under four different T-norm operators. 

From Figure 2.16, it can be observed that 


Td P (a , b) < T bp (a , b) < T ap (a , b) < T min (a , 6). 


This can be verified mathematically. 


(2.45) 

□ 


Like fuzzy intersection, the fuzzy union operator is specified in general by a 
function S : [0, 1] x [0, 1] — > [0, 1]. In symbols, 

Vaub(x) = S(pa{x),hb{x)) = Ha(x) + Hb{x), 


(2.46) 
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(a) Min 



(b) Mgabralc Product 




(d) Drastic Product 




Figure 2.16. ( First row) Four T-norm operators T min (a,b), T ap (a, b), 

Tb p (a, b), and Td p (a, b); (second row) the corresponding surfaces for a = 
trapezoid(x, 3,8,12,17) and b = trapezoid(y,Z, 8, 12, 17). (MATLAB file: 
tnorm.m) 


where + is a binary operator for the function 5. This class of fuzzy union operators, 
which are often referred to as T-conorm (or S-norm) operators, satisfy the following 
basic requirements. 


Definition 2.27 T-conorm (S-norm) 


A T-conorm (or S-norm) operator [3] is a two-place function 5(-, •) satisfying 


5(1, 1) = 1, 5(0, a) = 5(a, 0) = a 

(boundary) 


5(a, b) < 5(c, d) if a < c and b < d 

(monotonicity) 

(2.47) 

5(a,6) = 5(6, a) 

(commutativity) 

S(a,S(b,c)) = S(S(a,b),c ) 

(associativity) . 

□ 


The justification of these basic requirements is similar to that of the requirements 
for T-norm operators. 


Example 2.13 Four T-conorm operators 
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Figure 2.17. (First row) Four T-conorm operators S m i n (a,b), S ap {a,b ) 
Sb p (a,b ) and Sd p (a,b); (second row) the corresponding surfaces for a = 
trapezoid{x, 3, 8, 12, 17) and b = trapezoid(y, 3,8,12,17). (MATLAB file: 
t conorm, m) 


Corresponding to the four T-norm operators in the previous example, we have the 
following four T-conorm operators. 


Maximum: 
Algebraic sum: 
Bounded sum: 

Drastic sum: 


5 (a, b) = max(a, b) = a V b. 
S(a,b) = a + b — ab. 

S(a,b ) = 1 A (a + b). 


S(a,b) 


а, if b = 0. 

б, if a = 0. 

1, if a, b > 0 


(2.48) 


The first row of Figure 2.17 shows the surface plot of these T-conorm operators. 
The second row demonstrates the corresponding two-dimensional MFs when a = 
Ha{x) = trapezoid(r; 3, 8, 12, 17) and b = /j,b(x) = trapezoid(y; 3, 8, 12, 17); these 
MFs are the Cartesian coproduct of A and B using these four T-conorm operators. 
It can also be verified that 


Smax fr) ^ S a p (a, b ) < Sbp{a, b ) < Sd p {a , 6). (2.49) 

□ 

Note that these essential requirements for T-norm and T-conorm operators can- 
not uniquely determine the classical fuzzy intersection and union — namely, the min 
and max operators. Stronger restrictions have to be taken into consideration to 
pinpoint the min and max operators. For a detailed treatment of this subject, 
see [1]. 
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Theorem 2.1 Generalized DeMorgan’s law 


T-norms T(-, •) and T-conorms S(-, •) are duals which support the generalization of 

DeMorgan’s law: 

T(a,b) = N(S(N{a),N(b))), 

S(a,b) = N(T(N(a),N(b))), 

where N(-) is a fuzzy complement operator. If we use * and + for T-norm and 
T-conorm operators, respectively, then the preceding equations can be rewritten as 


a*b = N(N(a)+N(b)), 
a+b= N(N(a)*N{b)). 


(2.51) 

□ 


Thus for a given T-norm operator, we can always find a corresponding T-conorm 
operator through the generalized DeMorgan’s law, and vice versa. (In fact, the four 
T-norm and T-conorm operators in Examples 2.12 and 2.13, respectively, are dual 
in the sense of the generalized DeMorgan’s law. The reader is encouraged to verify 
this.) 


2.5.3 Parameterized T-norm and T-conorm* 

Several parameterized T-norms and dual T-conorms have been proposed in the 
past, such as those of Yager [9], Dubois and Prade [4], Schweizer and Sklar [7], and 
Sugeno [8]. For instance, Schweizer and Sklar ’s T-norm operator can be expressed 
as 


Tss(a , b,p) 
Sss(a,b,p) 

= [max{0, (a p -F b p — 1)}] p 
= 1 - [max{0, ((1 - a)- p + (1 - b)~ p - 1)}]“* 

(2.52) 

It is observed that 

Tss(a, b, p) = ab , 
limp^oo T S s(a, b,p) = min (a, 6), 

(2.53) 


which correspond to two of the more commonly used T-norms for the fuzzy AND 
operation. 

To give a general idea of how the parameter p affects the T-norm and T-conorm 
operators, Figure 2.18(a) shows typical membership functions of fuzzy sets A and 
B; Figure 2.18(b) and Figure 2.18(c) are Xss(a, b,p) and 5ss(a, 6,p), respectively, 
with p = oo (solid line), 1 (dashed line), 0 (dotted line) and —1 (dash-dotted line). 
Note that the bell-shaped membership functions of A and B in Figure 2.18(a) are 
defined as follows: 


p A (x) =bell(x; 5, 2, 7.5)= 1 + (2± s )4 , 

(2.54) 

p B (x) = bell(x; 5, 1, 5) = J + . 

(2.55) 


For completeness, other types of parameterized T-norms are given next. 
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(a) Fuzzy Sets A and B 
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0 00°00000»0000000o 0 


°° 
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-10 


10 


15 



Figure 2 . 18 . Schweizer and Sklar’s parameterized T-norms and T-conorms: (a) 
membership functions for fuzzy set A and B; (b) Tss(a,b,p) and (c) Sss(a,b,p) 
with p = oo (solid line), 1 (dashed line), 0 (dotted line) and —1 (dash-dotted line). 
(MATLAB file: sstnorm.m) 


Yager [ 9 ]: For q > 0 , 

f Ty(a, b,q) = \- min{l, [(1 - a) q + (1 - 6) 9 ] 1/9 }, 
\ S Y (a,b,q) = min{l,(a q + b q ) 1 / q }. 

Dubois and Prade [ 4 ]: For a G [ 0 , 1 ], 


( 2 . 56 ) 


{ ?r>p(a, 6, a) = abf max{a, 6, ct}, 

SDp(o,,b,a) = [a + b — ab — min{a, 6, (1 — a)}/ max{l — a, 1 — b,a}. 

( 2 . 57 ) 


Hamacher [6]: For 7 > 0 , 


Th (a, b, 7) = a6/[7 + (1 - 7 )(a + b - ab)], 

S H (a, b, 7) = [a + b + (7 - 2)a6]/[l + (7 - l)a&]. 


( 2 . 58 ) 


42 


Fuzzy Sets Ch. 2 


Frank [5]: For s > 0, 

f T F (a,b,s ) = logjl + (*° - lXs 6 - l)/(« - 1)], 

\ = 1 -log»[l + (S 1 - - lXs 1 - 6 - l)/(s - 1)]. 

Sugeno [8]: For A > — 1, 

( Ts(a , 6, A) = max{0, (A + l)(a + b — 1) — \ab}, 

\ Ss(a, 6, A) = min{l, a + b — \ab}. 

Dombi [2]: For A > 0, 

r Id (a, b,\) = x + [{o -i _ X) A + ( 6 -i _ 1 )A] 1 M> 

I 5l,(a ’ 6 ’ A) = r+ [(o - 1 - l)- x + (fr 1 - 


2.6 SUMMARY 


(2.59) 


(2.60) 


(2.61) 


This chapter introduces the basic definitions, notation, and operations for fuzzy 
sets, including their membership function representations, set- theoretic operations 
(AND, OR, and NOT), various types of membership functions, and advanced fuzzy 
set operators such as T-norms and T-conorms. 

Most membership functions are determined by domain experts. The human- 
determined membership functions, however, may not be precise enough for certain 
applications. Therefore, it is always advisable to apply optimization techniques 
to fine-tune parameterized membership functions for better performance. In the 
discussion of neuro-fuzzy modeling in the subsequent chapters, we shall come across 
parameterized membership functions again and use their derivatives for derivative- 
based optimization. 

Fuzzy sets lay the foundation for the entire fuzzy set theory and related disci- 
plines. In the next chapter, we shall introduce the use of fuzzy sets in fuzzy if-then 
rules and fuzzy reasoning. 


EXERCISES 

1. Sometimes it is useful to decompose an MF into a combination of its a-level 
sets’ MFs; this is the resolution principle, which states 

Ha(x) =max min[o:, HA a (x)], (2.62) 

a 

where A a is the a-cut (level set) of fuzzy set A and fiA a (x) is the MF of A a . 
[Remember that A a is a crisp set and thus jiA a (z) can take only values in 
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Resolution Principle 



Figure 2.19. Resolution principle. (MATLAB file: resolut.m) 


{0, 1}.] Figure 2.19 illustrates the concept of the resolution principle, where 
each rectangle under the MF represents min[o;; HA a (#)] for a specific a between 
0 and 1. (a) Adapt Equation (2.62) to a fuzzy set with a discrete universe of 
discourse, (b) Put the MF of fuzzy set A in Example 2.2 into the resolution 
format. 

2. Verify the identities in Table 2.1 using Venn diagrams. 

3. Determine if the classical fuzzy operators [Equations (2.13), (2.14), and (2.15)], 
hold for each identity in Table 2.1. Explain why by giving simple proofs or 
counterexamples . 

4. Repeat Exercise 3, assuming that the fuzzy union and intersection are defined 
differently: 


HAuB (x) = HA (x) + Hb (x) - HA (x)hb (x) 

Hahb(x) = ha(x)hb(x). 


(2.63) 

(2.64) 


5. Suppose that fuzzy set A is described by ha( x ) = bell(x; a, 6, c). Show that the 
classical fuzzy complement of A is described by h~a( x ) = bell(x; a, —6, c). 

6. The S-MF with two parameters / and r (l < r) is an S-shaped open-right MF 
defined by 


S(x;/,r) 


I 


0 , 

2 (fej') 2 ' 

l-2(^f) 2 , 

1 , 


for x <1. 

for l < x < - I p-. 

for l -^- <x<r. 
for r < x. 


(2.65) 


v 
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(a) Write a MATLAB function to implement this MF. (b) Plot instances of this 
MF with various values of parameters, (c) Find the crossover point of S(x; Z, r). 
(d) Prove that the derivative S(x; Z, r) with respect to x is continuous. 

7. The Z-MF with two parameters Z and r (Z < r) is a Z-shaped open-left MF 
defined by 

Z(x; Z, r) = 1 — S(x; Z, r), (2.66) 

where S(x; Z, r) is the S-MF in the previous exercise. Repeat (a) through (d) 
of Exercise 6 with the Z-MF. 


8. The tt-MF with two parameters a and c is a 7r-shaped MF defined via the S 
and Z-MFs introduced earlier, as follows: 


{ S(x; c — a, c), for x < c 
Z(x;c, c -I- a), for x > c, 


( 2 . 67 ) 


where c is the center and a (> 0) is the spread on each side of the MF. (a) 
Write a MATLAB function to implement this MF. (b) Plot instances of this 
MF with various values of parameters, (c) Find the crossover points and width 
of 7r (x;Z,r). 


9. The two-sided 7T-MF is an extension of the 7T-MF introduced previously; it 
is defined with four parameters a , b , c, and d: 


t 


ts_7r(x; a, 6, c, d) = < 


v 


0 , 

S(x, a, 6), 

1 , 

Z(x,c,d), 

0 , 


for x < a. 
for a < x < b. 
for b < x < c. 
for c < x < d. 
for d < x. 


( 2 . 68 ) 


(a) Write a MATLAB function to implement this MF. (b) Plot instances of this 
MF with various values of parameters, (c) Find the crossover points and width 
of ts_7r(x;a, 6, c, d). 

10. The two-sided Gaussian MF is defined by 


ts_gaussian(x; c \ , o \ , C 2 , 02 ) 



for x < ci . 
for ci < x < ci . 
for C 2 < x. 


( 2 . 69 ) 

(a) Write a MATLAB function to implement this MF. (b) Plot instances of this 
MF with various values of parameters, (c) Find the crossover points and width 
of this MF. 
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11 . 


12 . 


13 . 


14 . 


15 . 


16 . 


17 . 


18 . 


19 . 


20 . 


21 . 


22 . 


Find the level set A a and its width for a fuzzy set A defined by pa{x) — 
trapezoid(x; a, 6, c, d). 

Derive the partial derivatives of a Gaussian MF y = gaussian(r; c, cr) with 
respect to its argument x and parameters a and c, and thus verify Equa- 
tions (2.29) to (2.31). 

Derive the partial derivatives of a bell MF y = bell(r; a, 6, c) with respect to 
its argument x and parameters a, b , and c, and thus verify Equations (2.33) to 
(2.36). 

Let the fuzzy set A be defined by a Gaussian MF gaussian(r, c, cr). Show that 
width(A 0 . 99 )/width(A 0 .oi) is a constant independent of the parameters c and 
cr. 


Let the fuzzy set A be defined by a generalized bell MF bell(r, o, b, c). Show 
that width(A 0 . 99 )/width(A 0 .oi) is a function of parameter b only. 

Let the fuzzy set A be defined by a bell MF bell (x,a,b,c). Demonstrate that 
for all a E [0, 1), 

lim width (A a ) = 2a. 

b — >oo 

This demonstrates that a bell MF will approach a characteristic function of a 
crisp set if 6 — >• oo. 

Show that the Sugeno’s and Yager’s complement operators (Examples 2.10 and 
2.11) satisfy the involution requirement of Equation (2.38). 

Verify that the four T-norm and T-conorm operators in Examples 2.12 and 
2.13 are dual to each other in the sense of the generalized DeMorgan’s law. 

Prove the inequalities of Equations (2.45) and (2.49). 

Show that (a) Tss(a,b,p) = ab when p — )■ 0; (b) Tss(a,b,p) = min(a, b) when 
p — y oo. 


Show that the following operators on fuzzy sets satisfy DeMorgan’s law: (a) 
Dombi’s T-norm and T-conorm, with N(a ) = 1 - a; (b) Hamacher’s T-norm 
and T-conorm, with N(a ) = 1 — a; (c) max and min, with N(a ) as Sugeno’s 
complement. 


Show that the two-dimensional MF defined by 


VA(x,y) 


i+ 



b 


is a composite MF based on the one-dimensional generalized bell MF aggre- 
gated by Dombi’s T-norm operator. 
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23. Use bellmanu . m as a template to write new MATLAB files that allow the man- 
ual tuning of the following parameterized MFs: (a) triangular MFs; (b) trape- 
zoidal MFs; (c) Gaussian MFs; (d) sigmoidal MFs; (e) S-MFs (see Exercise 6); 
and (f) tt-MFs (also see Exercise 8). 

24. Use bellanim.m (or siganim.m) as a template to write new MATLAB files that 
show the animation of the following parameterized MFs: (a) triangular MFs; 
(b) trapezoidal MFs; (c) Gaussian MFs; (d) S-MFs (see Exercise 6); and (e) 
tt-MFs (see Exercise 8). 
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Chapter 3 


Fuzzy Rules 
and Fuzzy Reasoning 


J.-S. R. Jang 


3.1 INTRODUCTION 

In this chapter we introduce the concepts of the extension principle and fuzzy rela- 
tions, which expand the notions and applicability of fuzzy sets introduced previously. 
Then we present the definition of linguistic variables and linguistic values and ex- 
plain how to use them in fuzzy rules, which are an efficient tool for quantitative 
modeling of words or sentences in a natural or artificial language. By interpreting 
fuzzy rules as appropriate fuzzy relations, we investigate different schemes of fuzzy 
reasoning, where inference procedures based on the concept of the compositional 
rule of inference are used to derive conclusions from a set of fuzzy rules and known 
facts. 

Fuzzy rules and fuzzy reasoning are the backbone of fuzzy inference systems, 
which are the most important modeling tool based on fuzzy set theory. They 
have been successfully applied to a wide range of areas, such as automatic control, 
expert systems, pattern recognition, time series prediction, and data classification. 
In-depth discussion about fuzzy inference systems is provided in Chapter 4. 

3.2 EXTENSION PRINCIPLE AND FUZZY RELATIONS 

We shall start by giving definitions and examples of the extension principle and 
fuzzy relations, which are the rationales behind fuzzy reasoning. 

3.2.1 Extension Principle 

The extension principle [4, 8] is a basic concept of fuzzy set theory that provides 
a general procedure for extending crisp domains of mathematical expressions to 
fuzzy domains. This procedure generalizes a common point-to-point mapping of a 
function /(•) to a mapping between fuzzy sets. More specifically, suppose that / is 
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a function from X to Y and A is a fuzzy set on X defined as 

A = Ha(x l)/x\ + Ha{x 2)/*2 + 1- HA(Xn)/x n - 

Then the extension principle states that the image of fuzzy set A under the mapping 
/(•) can be expressed as a fuzzy set B, 

B = f(A) = VA(Xl)/yi + Va{x 2V2/2 + • • • + /M(«n)/l/n, 

where y* = f(xi), i = 1, ... ,n. In other words, the fuzzy set B can be defined 
through the values of /(•) in xi, . . . ,x n . If /(•) is a many-to-one mapping, then 
there exist x\,X 2 € X, x\ ^ X 2 , such that f(x 1 ) = f(x 2) — y*, y* £Y. In this case, 
the membership grade of B at y = y* is the maximum of the membership grades of 
A at x = X\ and x = X 2 , since f(x) = y* may result from either x = x\ or x = X 2 . 
More generally, we have 

Ms(y)= ma x a A (x). 

x=f~ 1 (y ) 

A simple example follows. 

Example 3.1 Application of the extension principle to fuzzy sets with discrete uni- 
verses 

Let 

A - 0.1/— 2 + 0.4/— 1 + 0.8/0 + 0.9/1 + 0.3/2 

and 

f(x) = x 2 — 3. 

Upon applying the extension principle, we have 

B = 0.1/1 + 0.4/— 2 + 0.8/— 3 + 0.9/— 2 + 0.3/1 
= 0.8/— 3 + (0.4 V 0.9)/— 2 + (0.1 V 0.3)/l 
= 0.8/-3 + 0.9/-2 + 0.3/1, 

where V represents max. Figure 3.1 illustrates this example. 

For a fuzzy set with a continuous universe of discourse X , an analogous proce- 
dure applies. 

Example 3.2 Application of the extension principle to fuzzy sets with continuous 
universes 

Let 

p A (x) = bell(x; 1.5, 2, 0.5) 


= { (X ' X, 2 “ 11 


if x > 0. 
if x < 0. 


and 
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Figure 3.1. Extension principle on fuzzy sets with discrete universes. 

Figure 3.2(a) is the plot of y = f(x); Figure 3.2(c) is /m(x), the MF of A. After 
employing the extension principle, we obtain a fuzzy set B; its MF is shown in 
Figure 3.2(b), where the plot of p,B(y) is rotated 90 degrees for easy viewing. Since 
f(x ) is a many-to-one mapping for x € [— 1 , 2], the max operator is used to obtain 
the membership grades of B when y € [0, 1]. This causes discontinuities of /is(y) 
at y = 0 and —1. The derivation of iib{v) is left as Exercise 1 at the end of this 
chapter. 


□ 

Now we consider a more general situation. Suppose that / is a mapping from 
an n-dimensional product space X\ x • • • X n to a single universe Y such that 
f(x i,...,ar n ) = y, and there is a fuzzy set Ai in each X{, i = 1, ...,n. Since 
each element in an input vector (xi, . . . , x n ) occurs simultaneously, this implies an 
AND operation. Therefore, the membership grade of fuzzy set B induced by the 
mapping / should be the minimum of the membership grades of the constituent 
fuzzy set Ai, i — 1 ,...,n. With this understanding, we give a complete formal 
definition of the extension principle. 

Definition 3.1 Extension principle 

Suppose that function / is a mapping from an n-dimensional Cartesian product 
space Xi x X 2 x • • • X n to a one-dimensional universe Y such that y = /(x i , . . . , x n ), 
and suppose A\, ..., A n axe n fuzzy sets in X\, ..., X n , respectively. Then the 
extension principle asserts that the fuzzy set B induced by the mapping / is defined 

t>y 


t*B(v) = { 


(3.1) 


max [min* /i^(xi)], iff l {y)±$. 

(xi,...,x„), (n ,...,x n )=/ x (y) 

o, if r\y) = 0. 


□ 
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Figure 3.2. Extension principle on fuzzy sets with continuous universes, as ex- 
plained in Example 3.2. The lower plot is fuzzy set A; the upper left is the function 
y = f(x); and the upper right is the fuzzy set B induced via the extension principle. 
(MATLAB file: extensio.m) 


The foregoing extension principle assumes that y = f(x i, . . . , x n ) is a crisp func- 
tion. In cases where / is a fuzzy function [or, more precisely, when y = f(x i , . . . , x n ) 
is a fuzzy set characterized by an (n + l)-dimensional MF], then we can employ the 
compositional rule of inference introduced in Section 3.4.1 (page 63) of the next 
chapter to find the induced fuzzy set B. 

3.2.2 Fuzzy Relations 

Binary fuzzy relations [4, 6] are fuzzy sets inlxF which map each element in 
X x Y to a membership grade between 0 and 1. In particular, unary fuzzy relations 
are fuzzy sets with one-dimensional MFs; binary fuzzy relations are fuzzy sets with 
two-dimensional MFs, and so on. Applications of fuzzy relations include areas such 
as fuzzy control and decision making. Here we restrict our attention to binary fuzzy 
relations; a generalization to n-ary relations is straightforward. 

Definition 3.2 Binary fuzzy relation 


Sec. 3.2. Extension Principle and Fuzzy Relations 


51 


Let X and Y be two universes of discourse. Then 

ft ={( («,!/), Unix, y ) ) | (x, y) € X x Y} (3.2) 

is a binary fuzzy relation inXx7. [Note that iin(x, y) is in fact a two-dimensional 
MF introduced in Section 2.4.2.] 


□ 


Example 3.3 Binary fuzzy relations 

Let X = Y = R + (the positive real line) and 7Z = “y is much greater than x.” The 
MF of the fuzzy relation 7 Z can be subjectively defined as 


Vn(x,y) 


y -x 
x + y + 2 ’ 
0 , 


if y > x. 
if y < x. 


(3.3) 


If X = {3,4,5} and Y = {3,4,5, 6,7}, then it is convenient to express the fuzzy 
relation 1Z as a relation matrix: 


7Z = 


0 0.111 0.200 0.273 0.333 

0 0 0.091 0.167 0.231 

0 0 0 0.077 0.143 


(3.4) 


where the element at row i and column j is equal to the membership grade between 
the ith element of X and jth element of Y. 


□ 


Other common examples of binary fuzzy relations are as follows: 

• x is close to y (x and y are numbers) 

• x depends on y (x and y are events) 

• x and y look alike (x and y are persons, objects, and so on) 

• If x is large, then y is small (x is an observed reading and y is a corresponding 
action). 

The last expression, “If x is A, then y is B,” is used repeatedly in a fuzzy inference 
system. We will explore fuzzy relations of this kind in the following chapter. 

Fuzzy relations in different product spaces can be combined through a compo- 
sition operation. Different composition operations have been suggested for fuzzy 
relations; the best known is the max-min composition proposed by Zadeh [4]. 

Definition 3.3 Max-min composition 
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Let 1Z\ and H 2 be two fuzzy relations defined onlxY and Y x Z, respectively. 
The max-min composition of IZi and 1Z 2 is a fuzzy set defined by 

1hon 2 = {[(a;,*), max min(^ 1 (x,y),^ 2 (y,^))]|x € X,y € Y,z € Z}, (3.5) 

y 

or, equivalently, 


mion 2 {x,z) = max minium (x,y),fm 2 (y,z)\ 

y 

= V„ [/JR, (x, y) A fin, (y, *)], 


(3.6) 


with the understanding that V and A represent max and min, respectively. 


□ 

When Hi and H 2 axe expressed as relation matrices, the calculation of H 1 o 1Z 2 
is almost the same as matrix multiplication, except that x and 4- are replaced by 
A and V, respectively. For this reason, the max-min composition is also called the 

max-min product. 

Several properties common to binary relations and max-min composition are 
given next, where 1Z, S, and T are binary relations on X x Y, Y x Z, and Z x W, 
respectively. 


Associativity: 1Z o (<S o T) = (1Z o S) o T 

Distributivity over union: 1Z o (S U T) = (1Z o S) U (1Z o T) 

Wealc distributivity over intersection: 1Z o (S fl T) C (1Z o S) fl (1Z o 7”) 

Monotonicity: S C T => Ho S CHoT 

Although max-min composition is widely used, it is not easily subjected to 
mathematical analysis. To achieve greater mathematical tractability, max-product 
composition has been proposed as an alternative to max-min composition. 

Definition 3.4 Max-product composition 

Assuming the same notation as used in the definition of max-min composition, we 
can define max-product composition as follows: 

^n l0 n 2 (x, z) = max \nm (x, y)pn 2 ( y , z)]. (3.8) 

y 

□ 

The following example demonstrates how to apply max-min and max-product 
composition and how to interpret the resulting fuzzy relations H\ o IZ 2 . 

Example 3.4 Max-min and max-product composition 
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Let 

72-1 = “x is relevant to y ” 

H 2 ~ “y is relevant to z v 

be two fuzzy relations defined onlxF and YxZ , respectively, where X — {1, 2, 3}, 
Y = {a, (3, 7,6}, and Z = {a, b}. Assume that 7£i and 77-2 can be expressed as the 
following relation matrices: 



0.1 

0.3 

0.5 

0.7 

= 

0.4 

0.2 

0.8 

0.9 


0.6 

0.8 

0.3 

0.2 


0.9 0.1 
0.2 0.3 
0.5 0.6 
0.7 0.2 


Now we want to find 1Z\ o 1Z 2 , which can be interpreted as a derived fuzzy relation 
“x is relevant to z” based on 1Z\ and 1l 2 . For simplicity, suppose that we are only 
interested in the degree of relevance between 2 (6 X) and a (€ Z). If we adopt 
max-min composition, then 


/iKioK 2 (2, a) = max(0.4 A 0.9, 0.2 A 0.2, 0.8 A 0.5, 0.9 A 0.7) 

= max(0.4, 0.2, 0.5, 0.7) 

= 0.7 (by max-min composition). 

On the other hand, if we choose max-product composition instead, we have 

y>n 1 on 2 { c ^i a ) — max(0.4 x 0.9, 0.2 x 0.2, 0.8 x 0.5, 0.9 x 0.7) 

= max(0.36, 0.04, 0.40, 0.63) 

= 0.63 (by max-product composition). 

Figure 3.3 illustrates the composition of two fuzzy relations, where the relation 
between element 2 in X and element a in Z is built up via the four possible paths 
(solid lines) connecting these two elements. The degree of relevance between 2 and 
a is the maximum of these four paths’ strengths, while each path’s strength is the 
minimum (or product) of the strengths of its constituent links. 


□ 

Both the max-min and max-product composition of two relation matrices can 
be obtained through the MATLAB file max_star.m. 

In the previous example, we used max to interpret OR and * to interpret AND, 
where * can be either min or product. The MATLAB file max_star .m (available via 
FTP or WWW, see page xxiii) can be used to compute the max-* composition of 
two relation matrices. In general, we can have (T-conorm)-(T-norm) composition 
that interprets OR and AND using T-conorm and T-norm operators, respectively. 
These extended meanings for fuzzy OR and AND are discussed in the next section. 
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X Y Z 



3.3 FUZZY IF-THEN RULES 

In this section, the definition and examples of linguistic variables are given first. 
Then we explain two interpretations of fuzzy if-then rules and how to obtain a 
fuzzy relation that represents the meaning of a given fuzzy rule. 

3.3.1 Linguistic Variables 

As was pointed out by Zadeh [7] , conventional techniques for system analysis are in- 
trinsically unsuited for dealing with humanistic systems, whose behavior is strongly 
influenced by human judgment, perception, and emotions. This is a manifestation 
of what might be called the principle of incompatibility: “As the complexity of 
a system increases, our ability to make precise and yet significant statements about 
its behavior diminishes until a threshold is reached beyond which precision and sig- 
nificance become almost mutually exclusive characteristics” [7]. It was because of 
this belief that Zadeh proposed the concept of linguistic variables [5] as an alterna- 
tive approach to modeling human thinking — an approach that, in an approximate 
manner, serves to summarize information and express it in terms of fuzzy sets in- 
stead of crisp numbers. We present the formal definition of linguistic variables next; 
the example that follows will clarify the definition, which may seem cryptic at the 
first reading. 

Definition 3.5 Linguistic variables and other related terminology 

A linguistic variable is characterized by a quintuple (x,T(x),X,G,M) in which 
x is the name of the variable; T(x) is the term set of x — that is, the set of its 
linguistic values or linguistic terms; X is the universe of discourse; G is a 
syntactic rule which generates the terms in T(x); and M is a semantic rule which 
associates with each linguistic value A its meaning M(A), where M(A) denotes a 
fuzzy set in X. 
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Figure 3.4. Typical membership functions of the term set T(age). (MATLAB file: 
lv.m) 


□ 


The following example helps clarify the preceding definition. 

Example 3.5 Linguistic variables and linguistic values 

If age is interpreted as a linguistic variable, then its term set T{age) could be 

T(age ) = { young, not young, very young, not very young, ..., 

middle aged, not middle aged, ..., , . 

old, not old, very old, more or less old, not very old, ..., ' ' ' 

not very young and not very old, ... }, 

where each term in T{age) is characterized by a fuzzy set of a universe of discourse 
X = [0,100], as shown in Figure 3.4. Usually we use “age is young” to denote 
the assignment of the linguistic value “young” to the linguistic variable age. By 
contrast, when age is interpreted as a numerical variable, we use the expression 
“age = 20” instead to assign the numerical value “20” to the numerical variable 
age. The syntactic rule refers to the way the linguistic values in the term set 
T(age) are generated. The semantic rule defines the membership function of each 
linguistic value of the term set; Figure 3.4 displays some of the typical membership 
functions. 


□ 

From the preceding example, we can see that the term set consists of several pri- 
mary terms (young, middle aged, old) modified by the negation (“not”) and/or 
the hedges (very, more or less, quite, extremely , and so forth), and then linked by 
connectives such as and, or, either, and neither. In the sequel, we shall treat the 
connectives, the hedges, and the negation as operators that change the meaning of 
their operands in a specified, context-independent fashion. 

Definition 3.6 Concentration and dilation of linguistic values 
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Let A be a linguistic value characterized by a fuzzy set with membership function 
Then A k is interpreted as a modified version of the original linguistic value 
expressed as 

A k = f [» A (x)] k /x. (3.10) 

Jx 

In particular, the operation of concentration is defined as 

CON(A) = A 2 , (3.11) 

while that of dilation is expressed by 

DIL(A) = A 0 ’ 5 . (3.12) 

□ 

Conventionally, we take CON(A) and DIL(A) to be the results of applying the 
hedges very and more or less, respectively, to the linguistic term A. However, other 
consistent definitions for these linguistic hedges are possible and well justified for 
various applications. 

Following the definitions in the previous chapter, we can interpret the negation 
operator NOT and the connectives AND and OR as 

NOT (A) = -A = [ [1 - »a{x)\/x, 

Jx 

AANDB = AnB= / \pa(x) A pb(x)\/x, (3.13) 

r 

A OR B = AUB = / \pa(x) V pb(x)]/x, 

Jx 

respectively, where A and B are two linguistic values whose meanings are defined 
by n A (-) and ^b(-)- 

Through the use of CON(-) and DIL(-) for the linguistic hedges very and more 
or less, together with the interpretations of negation and the connectives AND and 
OR in Equation (3.13), we are now able to construct the meaning of a composite 
linguistic term, such as “not very young and not very old” and “young but not 
too young.” 

Example 3.6 Constructing MFs for composite linguistic terms 

Let the meanings of the linguistic terms young and old be defined by the following 
membership functions: 

MyoungO*:) = bell(a:,20, 2,0) = 5 (3-14) 

1 v 20 / 

^old^) = bell (a:, 30, 3, 100) = / a;— ioo \6’ (3.15) 

1 ' v 30 / 

where x is the age of a given person, with the interval [0, 100] as the universe of 
discourse. Then we can construct MFs for the following composite linguistic terms: 
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• more or less old = DIL (old) = old? 5 



not young and not old = -i young fl -i old 


L 


1 - 


1 


1 + (SF 


20 ' 


A 


1 - 


1 


1 + ( 30 


x— 100\6 


)• 


young but not too young = young fl -i young 2 



x. 


• extremely old 

= CON(CON(CON(oid))) = ((oid 2 ) 2 ) 2 = jf [ t { ( j ii W)) , 


X. 


We assume that the meaning of the hedge too is the same as that of very and the 
meaning of extremely is the same as that of very very very. Figure 3.5(a) shows the 
MFs for the primary linguistic terms young and old; Figure 3.5(b) shows the MFs 
for the composite linguistic terms more or less old , not young and not old , young 
but not too young, and extremely old. 


□ 

Another operation that reduces the fuzziness of a fuzzy set A is defined as 
follows. 


Definition 3.7 Contrast intensification 


The operation of contrast intensification on a linguistic value A is defined by 


INT(A) = { A)2> 


for 0 < iia{x) < 0.5, 
for 0.5 < /xa(^) < 1- 


(3.16) 

□ 


The contrast intensifier INT increases the values of ha{x) which are above 0.5 
and diminishes those which are below this point. Thus, contrast intensification has 
the effect of reducing the fuzziness of linguistic value A. The inverse operator of 
contrast intensifier is contrast diminisher DIM, which is explored in Exercise 9. 

Example 3.7 Contrast intensifier. 

Let A be defined by 

Ma(^) = triangle(£, 1, 3, 9), 

which is a triangular MF with the vertex at x = 3 and the base located at x = 1 
to x = 9. Figure 3.6 illustrates the results of applying the contrast intensifier INT 
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(a) Primary Linguistic Values 




Figure 3.5. MFs for primary and composite linguistic values in Example 3.6. 
(MATLAB file: comply. m) 


to A several times: the solid line is A ; the dotted line is INT(A); the dashed line is 
INT 2 (A) = INT(INT(A)); and the dash-dot line is INT 3 (A) = INT(INT(INT(A))). 
Thus repeated applications of INT reduces the fuzziness of a fuzzy set; in the 
extreme case, the fuzzy set becomes a crisp set with boundaries at the crossover 
points. 

□ 

When we define MFs of linguistic values in a term set, it is intuitively reason- 
able to have these MFs roughly satisfy the requirement of orthogonality, which is 
described next. 

Definition 3.8 Orthogonality 

A term set T = ti , . . . , t n of a linguistic variable x on the universe X is orthogonal 
if it fulfills the following property: 

n 

£**(*) = 1, V* € X, (3.17) 

i=l 

where the s axe convex and normal fuzzy sets defined on X and these fuzzy sets 
make up the term set T. 
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Effects of Contrast Intensifier 



Figure 3.6. Example 3.7: the effects of the contrast intensifier. (MATLAB file: 
intensif .m) 


□ 

For the MFs in a term set to be intuitively reasonable, the orthogonality re- 
quirement has to be followed to some extent. This is shown in Figure 2.2, where 
the term set contains three linguistic terms, “young,” “middle aged,” and “old.” 

An in-depth exposition of linguistic variables and their applications can be found 
in [8]. Next, we shall discuss the use of linguistic variables and linguistic values in 
fuzzy if-then rules. 

3.3.2 Fuzzy If-Then Rules 

A fuzzy if-then rule (also known as fuzzy rule, fuzzy implication, or fuzzy 
conditional statement) assumes the form 

if a; is A then y is B, (3.18) 

where A and B axe linguistic values defined by fuzzy sets on universes of discourse 
X and y, respectively. Often u x is A” is called the antecedent or premise, while 
“y is B” is called the consequence or conclusion. Examples of fuzzy if-then rules 
axe widespread in our daily linguistic expressions, such as the following: 

• If pressure is high, then volume is small. 

• If the road is slippery, then driving is dangerous. 

• If a tomato is red, then it is ripe. 

• If the speed is high, then apply the brake a little. 

Before we can employ fuzzy if-then rules to model and analyze a system, first 
we have to formalize what is meant by the expression “if a: is A then y is I?” , which 
is sometimes abbreviated as A — »• B. In essence, the expression describes a relation 
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between two variables x and y; this suggests that a fuzzy if-then rule be defined as a 
binary fuzzy relation R on the product space 1x7. Generally speaking, there are 
two ways to interpret the fuzzy rule A — > B. If we interpret A — >• B as A coupled 
with B, then 


R = A->B = AxB = / ha{x) * hb(v)/(x, y ), 

JXxY 

where * is a T-norm operator and A — > B is used again to represent the fuzzy 
relation R. On the other hand, if A — >• B is interpreted as A entails B, then it can 
be written as four different formulas: 

• Material implication: 

R = A^B = -^AUB. (3.19) 

• Propositional calculus: 

R = A^B = nAu (AC\B). (3.20) 

• Extended propositional calculus: 

R = A ^ B = (-hA fl ->B) U B. (3.21) 

• Generalization of modus ponens: 

/fn(®, y) = sup{c | va(x) *c< HB(y) and 0 < c < 1}, (3.22) 

where R = A B and * is a T-norm operator. 

Although these four formulas are different in appearance, they all reduce to the 
familiar identity A — > B = -i A U B when A and B are propositions in the sense 
of two-valued logic. Figure 3.7 illustrates these two interpretations of a fuzzy rule 
A-t B. 

Based on these two interpretations and various T-norm and T-conorm operators, 
a number of qualified methods can be formulated to calculate the fuzzy relation 
R = A B. Note that R can be viewed as a fuzzy set with a two-dimensional MF 

VR{x,y) = f(^A(x),fi B (y)) = f(a,b), 

with a = n a(x), b = HB(y), where the function /, called the fuzzy implication 
function, performs the task of transforming the membership grades of a: in A and 
y in B into those of (x, y) in A B. 

Suppose that we adopt the first interpretation, “A coupled with B,” as the mean- 
ing of A — y B. Then four different fuzzy relations A — y B result from employing 
four of the most commonly used T-norm operators. 
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(a) (b) 


Figure 3.7. Two interpretations of fuzzy implication: (a) A coupled with B; (b) A 
entails B. 

• Rm = Ax B = f XxY Va{x) A , y), or f c (a, b) = a A b. This relation, 

which was proposed by Mamdani [3], results from using the min operator for 
conjunction. 

• R p = A x B = f XxY ft A (x)VB(y)/(x,y), or f p (a,b ) = ab. Proposed by 
Larsen [2], this relation is based on using the algebraic product operator for 
conjunction. 

• Rbp = A X B — f XxY [4a(x) © flB(y)/(x,y) = fxxY ® ^ (f^A(x) + f^B(y) — 
1)/ (x, y), or fb p (a, b) = 0 V (a + 6 - 1). This formula employs the bounded 
product operator for conjunction. 

• R dp = Ax B = ! XxY Ha(x) ~ p, B (y)/(x,y), or 

a if b = 1. 

/(a, b) = a : b = < b if a = 1. 

0 otherwise. 

This formula uses the drastic product operator for conjunction. 

The first row of Figure 3.8 shows these four fuzzy implication functions [with 
a = p>a(x) and b = /is(y)]; the second row shows the corresponding fuzzy relations 
Rm, Rp, Rbp, and R dp when ha{x) = bell(a:;4,3, 10) and pB(y) = bell(y;4, 3, 10). 

When we adopt the second interpretation, “A entails B ,” as the meaning of 
A-> B, again there are a number of fuzzy implication functions that are reasonable 
candidates. The following four have been proposed in the literature: 

• R a = -iAUB = f XxY lA(l—fi A (x)+fiB(y))/(x,y), or f a (a,b) = lA(l-a+fc). 
This is Zadeh’s arithmetic rule, which follows Equation (3.19) by using the 
bounded sum operator for U. 

• Rmm = ->A\J(AnB) = f XxY (l-lJiA(x))V(HA(x)AnB(v))/(x,y), or fm{a,b) = 
(1 — a) V (a A b). This is Zadeh’s max-min rule, which follows Equation (3.20) 
by using min for fl and max for U. 
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(«) Min (b) Algebraic Product 




(d) Drastic Product 




1 




Figure 3.8. First row: fuzzy implication functions based on the interpretation “A 
coupled with B second row: the corresponding fuzzy relations. (MATLAB file: 
fuzimp.m) 


• R s = iAU5 = / Xxy (l - Va(x)) V hb(x), or f s (a,b ) = (1 - a) V b. This is 
Boolean fuzzy implication using max for U. 

• Ra = J XxY (piA(x)<fiB(y))/(x,y), where 

- , _ J 1 if a < 6. 

a<b ~\ b/a if a > b. 

This is Goguen’s fuzzy implication, which follows Equation (3.22) by using 
the algebraic product for the T-norm operator. 

Figure 3.9 shows these four fuzzy implication functions [with a = ha{x) and 
b = HB{y)\ and the resulting fuzzy relations R a , Rmm, Rs, and Ra when ha{x) = 
bell(a:; 4, 3, 10) and pb ( y) = bell(y ; 4, 3, 10). 

It should be kept in mind that the fuzzy implication functions introduced here 
are by no means exhaustive. Interested readers can find other feasible fuzzy impli- 
cation functions in [1]. 

3.4 FUZZY REASONING 

Fuzzy reasoning, also known as approximate reasoning, is an inference procedure 
that derives conclusions from a set of fuzzy if-then rules and known facts. Before 
introducing fuzzy reasoning, we shall discuss the compositional rule of inference, 
which plays a key role in fuzzy reasoning. 
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(a) ZSdah's Arithmetic Rule 



(b) Zadeh's Max-Mki Rule 



(c) Boolean Fuzzy Implication 





Figure 3.9. First row: fuzzy implication functions based on the interpretation 
“A entails B second row: the corresponding fuzzy relations. (MATLAB file: 
fuzimp.m) 


3.4.1 Compositional Rule of Inference 

The concept behind the compositional rule of inference proposed by Zadeh [7] should 
not be totally new to the reader; we employed the same idea to explain the max-min 
composition of relation matrices in Section 3.2.2. Moreover, the extension principle 
in Section 3.2.1 is actually a special case of the compositional rule of inference. 

The compositional rule of inference is a generalization of the following familiar 
notion. Suppose that we have a curve y = f{x) that regulates the relation between x 
and y. When we are given x = a, then from y = f(x) we can infer that y = b = f(a); 
see Figure 3.10(a). A generalization of the aforementioned process would allow a to 
be an interval and f(x) to be an interval-valued function, as shown in Figure 3.10(b). 
To find the resulting interval y = b corresponding to the interval x = a, we first 
construct a cylindrical extension of a and then find its intersection I with the 
interval-valued curve. The projection of I onto the y-axis yields the interval y — b. 

Going one step further in our generalization, we assume that F is a fuzzy relation 
onlxF and A is a fuzzy set of X , as shown in Figures 3.11(a) and 3.11(b). To 
find the resulting fuzzy set B , again we construct a cylindrical extension c(A) with 
base A. The intersection of c(A) and F [Figure 3.11(c)] forms the analog of the 
region of intersection I in Figure 3.10(b). By projecting c{A .) fl F onto the y- axis, 
we infer y as a fuzzy set B on the y-axis, as shown in Figure 3.11(d). 

Specifically, let pa, Pc(A ), Mb, and pf be the MFs of A, c(A), B, and F, 
respectively, where p c (A) ls related to pa through 


Me(A)(*,y) — pa{x). 
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Figure 3.10. Derivation of y = b from x = a and y = f(x): (a) a and b are 
points, y = f(x) is a curve; (b) a and b are intervals, y = f(x) is an interval-valued 
function. 


Then 


Vc(A)nF{x,y) = min[/i c(A) (;r,y),^ F (a;,y)] 

= min lnA{x),ti F (x,y)\- 


By projecting c(A ) D F onto the y-axis, we have 

Pb(v) = max* min \p A {x) ,p F (x,y)] 
= Vx[i*a(x) An F {x,y)]. 


This formula reduces to the max-min composition (see Definition 3.3 in Section 3.2) 
of two relation matrices if both A (a unary fuzzy relation) and F (a binary fuzzy 
relation) have finite universes of discourse. Conventionally, B is represented as 


B = AoF, 


where o denotes the composition operator. 

It is interesting to note that the extension principle introduced in Section 3.2.1 
of the previous chapter is in fact a special case of the compositional rule of inference. 
Specifically, if y = f(x) in Figure 3.10 is a common crisp one-to-one or many-to-one 
function, then the derivation of the induced fuzzy set B on Y is exactly what is 
accomplished by the extension principle. 

Using the compositional rule of inference, we can formalize an inference proce- 
dure upon a set of fuzzy if-then rules. This inference procedure, generally called 
approximate reasoning or fuzzy reasoning, is the topic of the next subsection. 


3.4.2 Fuzzy Reasoning 

The basic rule of inference in traditional two-valued logic is modus ponens, ac- 
cording to which we can infer the truth of a proposition B from the truth of A and 
the implication A — ► B. For instance, if A is identified with “the tomato is red” 
and B with “the tomato is ripe,” then if it is true that “the tomato is red,” it is 
also true that “the tomato is ripe.” This concept is illustrated as follows: 



Figure 3.11. Compositional rule of inference. (MATLAB file: cri.m) 

premise 1 (fact): x is A, 

premise 2 (rule): if x is A then y is B, 

consequence (conclusion): y is B. 

However, in much of human reasoning, modus ponens is employed in an approx- 
imate manner. For example, if we have the same implication rule “if the tomato is 
red, then it is ripe” and we know that “the tomato is more or less red,” then we 
may infer that “the tomato is more or less ripe.” This is written as 

premise 1 (fact): x is A/, 

premise 2 (rule): if x is A then y is B, 

consequence (conclusion): y is B' , 

where A! is close to A and B' is close to B. When A , B , A ', and B' are fuzzy sets 
of appropriate universes, the foregoing inference procedure is called approximate 
reasoning or fuzzy reasoning; it is also called generalized modus ponens 
(GMP for short), since it has modus ponens as a special case. 

Using the composition rule of inference introduced in the previous subsection, we 
can formulate the inference procedure of fuzzy reasoning as the following definition. 

Definition 3.9 Approximate reasoning ( fuzzy reasoning) 

Let A, A', and B be fuzzy sets of X, X, and Y , respectively. Assume that the fuzzy 
implication A — »■ B is expressed as a fuzzy relation R on X xY. Then the fuzzy 
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set B induced by “x is A and the fuzzy rule “if x is A then y is B ” is defined by 

HB'(y) = max* min \/Ji A '(x),p R (x,y)] /« 

= V*^(ar) ^ ‘ 

or, equivalently, 

B' = A! o R = A' o {A B). (3.24) 

□ 

Now we can use the inference procedure of fuzzy reasoning to derive conclusions, 
provided that the fuzzy implication A ->■ B is defined as an appropriate binary fuzzy 
relation. 

In what follows, we shall discuss the computational aspects of the fuzzy reasoning 
introduced in the preceding definition, and then extend the discussion to situations 
in which multiple fuzzy rules with multiple antecedents axe involved in describing a 
system’s behavior. However, we will restrict our considerations to Mamdani’s fuzzy 
implication functions and the classical max-min composition, because of their wide 
applicability and easy graphic interpretation. 


Single Rule with Single Antecedent 

This is the simplest case, and the formula is available in Equation (3.23). A further 
simplification of the equation yields 

Hb> ( V ) = [v* {ha> (a?) A HA (x)] A HB (y) 

= w AfiB{y)- 

In other words, first we find the degree of match w as the maximum of ha> (x)A/j,a(x) 
(the shaded area in the antecedent part of Figure 3.12); then the MF of the resulting 
B' is equal to the MF of B clipped by w, shown as the shaded area in the consequent 
part of Figure 3.12. Intuitively, w represents a measure of degree of belief for the 
antecedent part of a rule; this measure gets propagated by the if-then rules and the 
resulting degree of belief or MF for the consequent part (. B ' in Figure 3.12) should 
be no greater than w. 

Single Rule with Multiple Antecedents 

A fuzzy if-then rule with two antecedents is usually written as “if x is A and y is 
B then z is C.” The corresponding problem for GMP is expressed as 

premise 1 (fact): x is A' and y is B' , 

premise 2 (rule): if x is A and y is B then z is C, 

consequence (conclusion): z is C'. 

The fuzzy rule in premise 2 can be put into the simpler form “A x B — »■ C. n 
Intuitively, this fuzzy rule can be transformed into a ternary fuzzy relation Rm 
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Figure 3.12. Graphic interpretation of GMP using Mamdani’s fuzzy implication 
and the max-min composition. 


based on Mamdani’s fuzzy implication function, as follows: 

R m (A,B,C) = (Ax B) x C = / Ha(x) A fi B (y) A pc(z)/(x,y,z). 

JXxYxZ 

The resulting C' is expressed as 


C' = (A! x B') o (A x B C). 


Thus 


HC'(z) = V x ,y[HA'(x) AHB'(y)] A [ha(x) Ap B (y) Ap c (z)] 

= VxA&A'ix) A VB'iy) A Ha(x) A PB(y)]} A Hc(z) 

= {Vx[ha>(x) a pa(x)]} a { Vy[p B >(y ) a n B (y)]} a p c (z) 

> ' ' V ' 

W\ 11)2 

= (wi A w 2 \ Apc(z), 

firing 

strength 


(3.25) 


where w\ and W 2 are the maxima of the MFs of A fl A! and B fl B' , respectively. 
In general, w\ denotes the degrees of compatibility between A and A similarly 
for w 2 . Since the antecedent part of the fuzzy rule is constructed by the connective 
“and,” w\ A w? is called the firing strength or degree of fulfillment of the 
fuzzy rule, which represents the degree to which the antecedent part of the rule is 
satisfied. A graphic interpretation is shown in Figure 3.13, where the MF of the 
resulting C' is equal to the MF of C clipped by the firing strength w, w = w\ AW 2 . 
The generalization to more than two antecedents is straightforward. 

An alternative way of calculating C' is explained in the following theorem. 

Theorem 3.1 Decomposition method for calculating B' 


a 


(A' x B') o (A x B C) 

[A' o (A ->• C )] n [. B ' o(B^ C)] 


(3.26) 
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min 



Figure 3.13. Approximate reasoning for multiple antecedents. 


Proof: 

pc (z) = V x ,y [pa> (a?) A p B > ( y )] A \p A (x) A p B (y) A p c (z)] 

= Vx{pa>(x) Ap B (y)Apc(z)AVy[pB'(y) a p B (y) a p c (z)]} 

= PA'(x) A p B (y) A nc(z) a p B >o(B^C)(z)} 1 ' 

= PA'o(A-*C)(y) h p B 'o(B-+C)(y)- 

□ 

The preceding theorem states that the resulting consequence C' can be expressed 
as the intersection of C[ = A' o (A — t C ) and C 2 = B' o {B — > C), each of which 
corresponds to the inferred fuzzy set of a GMP problem for a single fuzzy rule with 
a single antecedent. 

Multiple Rules with Multiple Antecedents 

The interpretation of multiple rules is usually taken as the union of the fuzzy re- 
lations corresponding to the fuzzy rules. Therefore, for a GMP problem written 
as 

premise 1 (fact): x is A' and y is B\ 

premise 2 (rule 1): if x is A\ and y is B\ then z is Ci, 

premise 3 (rule 2): if x is A 2 and y is B 2 then z is C 2 , 

consequence (conclusion): z is C', 

we can employ the fuzzy reasoning shown in Figure 3.14 as an inference procedure 
to derive the resulting output fuzzy set C' . 

To verify this inference procedure, let R\ = A\ x B\ C\ and R 2 = A 2 x B 2 ► 
C 2 . Since the max-min composition operator o is distributive over the U operator, 
it follows that 

C = (A 1 x B') o (R 1 U R 2 ) 

= [(A' x B') o Hr] U [(A' x B ') o R 2 ] (3.28) 

= CJUQ, 

where C[ and C 2 are the inferred fuzzy sets for rules 1 and 2, respectively. Fig- 
ure 3.14 shows graphically the operation of fuzzy reasoning for multiple rules with 
multiple antecedents. 
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min 



Figure 3.14. Fuzzy reasoning for multiple rules with multiple antecedents. 


When a given fuzzy rule assumes the form “if x is A or y is B then z is C,” then 
firing strength is given as the maximum of degree of match on the antecedent part 
for a given condition. This fuzzy rule is equivalent to the union of the two fuzzy 
rules “if x is A then z is C” and “if y is B then z is C” 

In summary, the process of fuzzy reasoning or approximate reasoning can be 
divided into four steps: 

Degrees of compatibility Compare the known facts with the antecedents of fuzzy 
rules to find the degrees of compatibility with respect to each antecedent MF. 

Firing strength Combine degrees of compatibility with respect to antecedent MFs 
in a rule using fuzzy AND or OR operators to form a firing strength that 
indicates the degree to which the antecedent part of the rule is satisfied. 

Qualified (induced) consequent MFs Apply the firing strength to the conse- 
quent MF of a rule to generate a qualified consequent MF. (The qualified 
consequent MFs represent how the firing strength gets propagated and used 
in a fuzzy implication statement.) 

Overall output MF Aggregate all the qualified consequent MFs to obtain an 
overall output MF. 

These four steps are also employed in a fuzzy inference system, which is intro- 
duced in Chapter 4. 
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3.5 SUMMARY 

This chapter introduces the extension principle, fuzzy relations, fuzzy if-then rules, 
and fuzzy reasoning. The extension principle provides a procedure for mappings 
between fuzzy sets. Fuzzy relations and their composition rules stipulate fuzzy 
sets and their combinations and interpretations in a multidimensional space. By 
interpreting fuzzy if-then rules as fuzzy relations, various schemes of fuzzy reasoning 
(based on the concept of the compositional rule of inference) are commonly used to 
derive conclusions from a set of fuzzy if-then rules. 

Fuzzy if-then rules and fuzzy reasoning are the backbone of fuzzy inference 
systems, which are the most important modeling tool based on fuzzy set theory. 
In-depth discussion about fuzzy inference systems is provided in the next chapter; 
their adaptive version and corresponding applications are investigated in Chap- 
ters 12 and 19. 


EXERCISES 

1. Apply the extension principle to derive the MF of the fuzzy set B in Exam- 
ple 3.2 (Figure 3.2). 

2. Prove the identities in Equation (3.7) for max-min composition. 

3. Do the identities in Equation (3.7) hold for other types of composition? 

4. Carry out the calculation of o % 2 in Example 3.4, using both max-min and 
max-product composition. Double check your results using the MATLAB file 
max_star .m. 

5. Repeat Example 3.6 with the new meanings of old and new defined by the 
following triangular MFs: 

Myoung(^) = gaussian(x, 0, 20) = e - ^^ 2 , 

H old^) = gaussian(a:, 100, 30) = e - ^ 5-5 ® 5 ^ 2 . 

Plot the MFs for the linguistic values in Example 3.6 using MATLAB. 

6. Use the MFs of old and small in Exercise 5 to generate the MFs for the following 
nonprimary terms: 

(a) not very young and not very old 

(b) very young or very old 

Plot the MFs for these two linguistic values using MATLAB. 

7 . Increasing the magnitude (absolute value) of the b parameter of a bell MF has 
an effect similar to that of the contract intensifier in Equation (3.16) — that is, 
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it increases the membership grades above 0.5 and diminishes those below 0.5. 
Explain why. 

8. Suppose that the MFs for fuzzy set A and B are a trapezoidal MF trapezoid(x; a, b , c, d ) 
and a two-sided 7T-MF ts-7r(x; o, 6, c, d) [Equation (2.68)], respectively. Show 

that INT (A) = B. 

9. Find an operator contrast diminisher DIM that is the inverse of contrast 
intensifier INT. Namely, for any fuzzy set A , DIM should satisfy the following: 

DIM(INT(A)) = A. 

Repeat the plot in Figure 3.6 but use the contrast diminisher instead. 

10. Verify that Equations (3.19) to (3.22) reduce to the familiar identity A — »• B = 

-i A U B when A and B are propositions in the sense of two- value logic. 

11. Repeat the plot of the fuzzy relations in the second rows of Figures 3.8 and 3.9, 
assuming A and B are defined by 

(a) ha(x ) = triangle(x, 5, 10, 15) and HB{y) — triangle(y, 5, 10, 15). 

(b) ha(x) = trapezoid(x, 3,8, 12, 17) and Hb(v) = trapezoid (y, 3, 8, 12,17). 


12. Another fuzzy implication function based on the interpretation “A entails B 
can be expressed as 


3) 


R s — A — y B 

= fxxY s 9n[nB(y) ~ ii A (x)]/(x,y), 

or, alternatively, 


/,M) = sgn(b -<*) = {£; otherwise, 

where a = ha(x ) and b = Hsiy)- Plot z = f s (a , 6) and hr s ( x , y ) = sgn[jiB{y)- 
Ha(x)], assuming the MFs for A and B are ha{x) = bell (x, 4, 3, 10) and 
HB{y) = bell(y,4, 3, 10), respectively. 

13. Use the identities in the previous exercise to show that the fuzzy rule “if x is 
A or y is B then z is C” is equivalent to the union of the two fuzzy rules “if x 
is A then z is C n and “if y is B then 2 is C n under max-min composition. 


REFERENCES 


[1] S. Fukami, M. Mizumoto, and K. Tanaka. Some considerations on fuzzy conditional 
inference. Fuzzy Sets and Systems, 4:243-273, 1980. 



72 


Fuzzy Rules and Fuzzy Reasoning Ch. 3 


[2] P. M. Larsen. Industrial applications of fuzzy logic control. International Journal of 
Man-Machine Studies , 12(1):3-10, 1980. 

[3] E. H. Mamdani and S. Assilian. An experiment in linguistic synthesis with a fuzzy 
logic controller. International Journal of Man- Machine Studies, 7(1):1— 13, 1975. 

[4] L. A. Zadeh. Fuzzy sets. Information and Control, 8:338-353, 1965. 

[5] L. A. Zadeh. Quantitative fuzzy semantics. Information Sciences, 3:159-176, 1971. 

[6] L. A. Zadeh. Similarity relations and fuzzy ordering. Information Sciences, 3:177-206, 
1971. 

[7] L. A. Zadeh. Outline of a new approach to the analysis of complex systems and decision 
processes. IEEE Transactions on Systems, Man, and Cybernetics, 3(l):28-44, January 
1973. 

[8] L. A. Zadeh. The concept of a linguistic variable and its application to approximate 
reasoning, Parts 1, 2 and 3. Information Sciences, 8:199-249, 8:301-357, 9:43-80, 1975. 



Chapter 4 


Fuzzy Inference Systems 


J.-S. R. Jang 

4.1 INTRODUCTION 

The fuzzy inference system is a popular computing framework based on the 
concepts of fuzzy set theory, fuzzy if-then rules, and fuzzy reasoning. It has found 
successful applications in a wide variety of fields, such as automatic control, data 
classification, decision analysis, expert systems, time series prediction, robotics, and 
pattern recognition. Because of its multidisciplinary nature, the fuzzy inference 
system is known by numerous other names, such as fuzzy-rule-based system, 
fuzzy expert system [2], fuzzy model [10, 9], fuzzy associative memory [3], 
fuzzy logic controller [6, 4, 5], and simply (and ambiguously) fuzzy system. 

The basic structure of a fuzzy inference system consists of three conceptual 
components: a rule base, which contains a selection of fuzzy rules; a database 
(or dictionary), which defines the membership functions used in the fuzzy rules; 
and a reasoning mechanism, which performs the inference procedure (usually 
the fuzzy reasoning introduced in Section 3.4.2) upon the rules and given facts to 
derive a reasonable output or conclusion. 

Note that the basic fuzzy inference system can take either fuzzy inputs or crisp 
inputs (which are viewed as fuzzy singletons), but the outputs it produces are al- 
most always fuzzy sets. Sometimes it is necessary to have a crisp output, especially 
in a situation where a fuzzy inference system is used as a controller. Therefore, 
we need a method of defuzzification to extract a crisp value that best represents 
a fuzzy set. A fuzzy inference system with a crisp output is shown in Figure 4.1, 
where the dashed line indicates a basic fuzzy inference system with fuzzy output 
and the defuzzification block serves the purpose of transforming an output fuzzy 
set into a crisp single value. An example of a fuzzy inference system without de- 
fuzzification block is the two-rule two-input system of Figure 3.14. The function of 
the defuzzification block is explained in Section 4.2. 

With crisp inputs and outputs, a fuzzy inference system implements a nonlinear 
mapping from its input space to output space. This mapping is accomplished by 
a number of fuzzy if-then rules, each of which describes the local behavior of the 
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y 


Figure 4.1. Block diagram for a fuzzy inference system. 


mapping. In particular, the antecedent of a rule defines a fuzzy region in the input 
space, while the consequent specifies the output in the fuzzy region. 

In what follows, we shall first introduce three types of fuzzy inference systems 
that have been widely employed in various applications. The differences between 
these three fuzzy inference systems lie in the consequents of their fuzzy rules, and 
thus their aggregation and defuzzification procedures differ accordingly. Then we 
will introduce and compare three different ways of partitioning the input space; these 
partitioning methods can be adopted by any fuzzy inference system, regardless of the 
structure of the consequents of its rules. Finally, we will address briefly the features 
and the problems of fuzzy modeling, which is concerned with the construction of 
fuzzy inference systems for modeling a given target system. 


4.2 MAMDANI FUZZY MODELS 

The Mamdani fuzzy inference system [6] was proposed as the first attempt 
to control a steam engine and boiler combination by a set of linguistic control 
rules obtained from experienced human operators. Figure 4.2 is an illustration of 
how a two-rule Mamdani fuzzy inference system derives the overall output 2 when 
subjected to two crisp inputs x and y. 

If we adopt max and algebraic product as our choice for the T-norm and 
T-conorm operators, respectively, and use max-product composition instead of the 
original max-min composition, then the resulting fuzzy reasoning is shown in Fig- 
ure 4.3, where the inferred output of each rule is a fuzzy set scaled down by its 
firing strength via algebraic product. Although this type of fuzzy reasoning was 
not employed in Mamdani’s original paper, it has often been used in the literature. 
Other variations are possible if we use different T-norm and T-conorm operators. 

In Mamdani’s application [6], two fuzzy inference systems were used as two 
controllers to generate the heat input to the boiler and throttle opening of the 
engine cylinder, respectively, to regulate the steam pressure in the boiler and the 
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min 



Figure 4.2. The Mamdani fuzzy inference system using min and max for T-norm 
and T-conorm operators, respectively. 


speed of the engine. Since the plant takes only crisp values as inputs, we have to 
use a defuzzifier to convert a fuzzy set to a crisp value. 

Defuzzification 

Defuzzification refers to the way a crisp value is extracted from a fuzzy set as 
a representative value. In general, there are five methods for defuzzifying a fuzzy 
set A of a universe of discourse Z, as shown in Figure 4.4. (Here the fuzzy set A is 
usually represented by an aggregated output MF, such as C' in Figures 4.2 and 4.3.) 
A brief explanation of each defuzzification strategy follows. 

• Centroid of area zqqa : 


_ J z l i A(z)zdz 
* C0A S z » A {z)dz ' 


(4.1) 


where iia{z) is the aggregated output MF. This is the most widely adopted 
defuzzification strategy, which is reminiscent of the calculation of expected 
values of probability distributions. 
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product 



t Z 


* CO A 

Figure 4.3. The Mamdani fuzzy inference system using product and max for T- 
norm and T-conorm operators, respectively. 


• Bisector of area zgQA : z BOA satisfies 



BOA 


p A (z) dz = 



dz, 


(4.2) 


where a = min{z|z E Z} and = max{z|z E Z}. That is, the vertical line 
z = zgoA partitions the region between z = a, z = /3, y = 0 and y = pa(z) 
into two regions with the same area. 


• Mean of maximum 2 M0M * s the average of the maximizing z at 

which the MF reach a maximum p* . In symbols, 


Iz 1 z dz 

"MOM = J—fo’ 


(4.3) 


where Z' = {z | pa(z) = p*}- In particular, if pa{z) has a single maximum at 
z — z*, then ^MOM = z *- Moreover, if Pa(z) reaches its maximum whenever 
z E [^left bright! ( this is the case in Fi gure 4.4), then z M qM = (*left + 
z r ight)/2- The mean of maximum is the defuzzification strategy employed in 
Mamdani’s fuzzy logic controllers [6]. 
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Figure 4.4. Various defuzzification schemes for obtaining a crisp output. 


• Smallest of maximum ^S0M : z SOM * s minimum (in terms of magnitude) 
of the maximizing z. 

• Largest of maximum 2 lOM : ^LOM maximum (in terms of magnitude) 
of the maximizing z. Because of their obvious bias, zgQM an( i LOM are n0 ^ 
used as often as the other three defuzzification methods. 

The calculation needed to carry out any of these five defuzzification operations 
is time-consuming unless special hardware support is available. Furthermore, these 
defuzzification operations are not easily subject to rigorous mathematical analysis, 
so most of the studies are based on experimental results. This leads to the proposi- 
tions of other types of fuzzy inference systems that do not need defuzzification at all; 
two of them are introduced in the next section. Other more flexible defuzzification 
methods can be found in [7, 8, 12]. 

The following two examples axe single-input and two-input Mamdani fuzzy mod- 
els. 

Example 4.1 Single-input single-output Mamdani fuzzy model 

An example of a single-input single-output Mamdani fuzzy model with three rules 
can be expressed as 


{ If X is small then Y is small. 

If X is medium then Y is medium. 

If X is large then Y is large. 

Figure 4.5(a) plots the membership functions of input X and output F, where 
the input and output universe are [—10, 10] and [0, 10], respectively. With max- 
min composition and centroid defuzzification, we can find the overall input-output 
curve, as shown in Figure 4.5(b). Note that the output variable never reaches the 
maximum (10) and minimum (0) of the output universe. Instead, the reachable 
minimum and maximum of the output variable are determined by the centroids of 
the leftmost and rightmost consequent MFs, respectively. 
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Figure 4.5. Single-input single-output Mamdani fuzzy model in Example 4-1' (a) 
antecedent and consequent MFs; (b) overall input-output curve. (MATLAB file: 
maml . m) 


□ 


Example 4.2 Two-input single-output Mamdani fuzzy model 

An example of a two-input single-output Mamdani fuzzy model with four rules can 
be expressed as 


( If X is small and Y is small then Z is negative large. 

If X is small and Y is large then Z is negative small. 

If X is large and Y is small then Z is positive small. 

If X is large and Y is large then Z is positive large. 

Figure 4.6(a) plots the membership functions of input X and Y and output Z, all 
with the same universe [—5, 5]. With max-min composition and centroid defuzzi- 
fication, we can find the overall input-output surface, as shown in Figure 4.6(b). 
For a multiple-input fuzzy model, sometimes it is helpful to have a tool for viewing 
the process of fuzzy inference; Figure 4.7 is the fuzzy inference viewer available in 
the Fuzzy Logic Toolbox, where you can change the input values by click and drag 
the input vertical lines and then see the interactive changes of qualified consequent 
MFs and overall output MF. 


□ 




Sec. 4.2. Mamdani Fuzzy Models 


79 



Figure 4.6. Two-input single-output Mamdani fuzzy model in Example 1.2: (a) 
antecedent and consequent MFs; (b) overall input-output surface. (MATLAB file: 
mam2 . m) 

4.2.1 Other Variants 

Figures 4.2 and 4.3 conform to the fuzzy reasoning defined previously. However, 
in consideration of computation efficiency or mathematical tractability, a fuzzy in- 
ference system in practice may have a certain reasoning mechanism that does not 
follow the strict definition of the compositional rule of inference. For instance, 
one might use product for computing firing strengths (for rules with AND’ed an- 
tecedent), min for computing qualified consequent MFs, and max for aggregating 
them into an overall output MF. Therefore, to completely specify the operation of 
a Mamdani fuzzy inference system, we need to assign a function for each of the 
following operators: 

• AND operator (usually T-norm) for calculating the firing strength of a rule 
with AND’ed antecedents. 

• OR operator (usually T-conorm) for calculating the firing strength of a rule 
with OR’ed antecedents. 

• Implication operator (usually T-norm) for calculating qualified consequent 
MFs based on given firing strength. 

• Aggregate operator (usually T-conorm) for aggregating qualified conse- 
quent MFs to generate an overall output MF. 


Defuzzification operator for transforming an output MF to a crisp single 
output value. 
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Figure 4.7. Fuzzy inference viewer in the Fuzzy Logic Toolbox. This is obtained 
by typing fuzzy mam21 within MATLAB. 


One such example is to use product for the implication operator and point- 
wise summation (sum) for the aggregate operator. (Note that sum is not even a 
T-conorm operator.) An advantage of this sum-product composition [3] is that 
the final crisp output via centroid defuzzification is equal to the weighted average of 
the centroids of consequent MFs, where the weighting factor for each rule is equal 
to its firing strength multiplied by the area of the consequent MF. This is expressed 
as the following theorem. 


Theorem 4.1 Computation shortcut for Mamdani fuzzy inference systems 


Under sum-product composition, the output of a Mamdani fuzzy inference system 
with centroid defuzzification is equal to the weighted average of the centroids of 
consequent MFs, where each of the weighting factors is equal to the product of a 
firing strength and the consequent MF’s area. 

Proof: We shall prove this theorem for a fuzzy inference system with two rules (see 
Figure 4.3). By using product and sum for implication and aggregate operators, 
respectively, we have 


Hc>{z) = Wl HCi (z) + W2lic 2 (z). 
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Product 



z i = Pi x + + r j 


z 2 = p 2 x+q 2 y + r 2 



Weighted Average 


WjZi + W 2 Z 2 
W 1 + w 2 


Figure 4.8. The Sugeno fuzzy model. 


(Note that the preceding MF could have values greater than 1 at certain points.) 
The crisp output under centroid defuzzification is 


z COA = 


Iz VC' {z)zdz 

Jz»C'(z)dz 

wi f jic i ( z)zdz + w 2 f Hc-i {z)zdz 

w i f MCi (z)dz + w 2 f tic 2 ( z)dz 
w\a\Z\ +w 2 a 2 Z2 
w\gi + w 2 g-i ’ 


where a* (= J z fid {z)dz) and Zi (= 
consequent MF iiCi(z), respectively. 


fz/*c<(z)zdz 
Iz^Ci {z)dz 


are the area and centroid of the 


□ 

By using this theorem, computation is more efficient if we can obtain the area 
and centroid of each consequent MF in advance. 

4.3 SUGENO FUZZY MODELS 

The Sugeno fuzzy model (also known as the TSK fuzzy model) was proposed 
by Takagi, Sugeno, and Kang [10, 9] in an effort to develop a systematic approach 
to generating fuzzy rules from a given input-output data set. A typical fuzzy rule 
in a Sugeno fuzzy model has the form 

if z is A and y is B then z = /(x, y ), 

where A and B are fuzzy sets in the antecedent, while z = f(x,y ) is a crisp function 
in the consequent. Usually f(x, y) is a polynomial in the input variables x and y, 
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but it can be any function as long as it can appropriately describe the output of 
the model within the fuzzy region specified by the antecedent of the rule. When 
f(x , y) is a first-order polynomial, the resulting fuzzy inference system is called 
a first-order Sugeno fuzzy model, which was originally proposed in [10, 9]. 
When / is a constant, we then have a zero-order Sugeno fuzzy model, which 
can be viewed either as a special case of the Mamdani fuzzy inference system, in 
which each rule’s consequent is specified by a fuzzy singleton (or a pre-defuzzified 
consequent), or a special case of the Tsukamoto fuzzy model (to be introduced 
next), in which each rule’s consequent is specified by an MF of a step function 
center at the constant. Moreover, a zero-order Sugeno fuzzy model is functionally 
equivalent to a radial basis function network under certain minor constraints [1], as 
will be detailed in Chapter 12. 

The output of a zero-order Sugeno model is a smooth function of its input 
variables as long as the neighboring MFs in the antecedent have enough overlap. In 
other words, the overlap of MFs in the consequent of a Mamdani model does not 
have a decisive effect on the smoothness; it is the overlap of the antecedent MFs 
that determines the smoothness of the resulting input-output behavior. 

Figure 4.8 shows the fuzzy reasoning procedure for a first-order Sugeno fuzzy 
model. Since each rule has a crisp output, the overall output is obtained via 
weighted average, thus avoiding the time-consuming process of defuzzification 
required in a Mamdani model. In practice, the weighted average operator is some- 
times replaced with the weighted sum operator (that is, z = W\Z\ 4- W 2 Z 2 in 
Figure 4.8) to reduce computation further, especially in the training of a fuzzy in- 
ference system. However, this simplification could lead to the loss of MF linguistic 
meanings unless the sum of firing strengths (that is, Wi) is close to unity. 

Since the only fuzzy part of a Sugeno model is in its antecedent, it is easy to 
demonstrate the distinction between a set of fuzzy rules and nonfuzzy ones. 

Example 4.3 Fuzzy and nonfuzzy rule set — a comparison 

An example of a single-input Sugeno fuzzy model can be expressed as 

{ If X is small then Y = 0.1X + 6.4. 

If X is medium then Y = — 0.5X + 4. 

If X is large then V = X — 2. 

If “small,” “medium,” and “large” axe nonfuzzy sets with membership functions 
shown in Figure 4.9(a), then the overall input-output curve is piecewise linear, as 
shown in Figure 4.9(b). On the other hand, if we have smooth membership functions 
[Figure 4.9(c)] instead, the overall input-output curve [Figure 4.9(d)] becomes a 
smoother one. 


□ 

Sometimes a simple Sugeno fuzzy model can generate complex behavior. The 
following is an example of a two-input system. 
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(a) Antecedent MFs for Crisp Rules (b) Overall I/O Curve for Crisp Rules 



(c) Antecedent MFs for Fuzzy Rules (d) Overall I/O Curve for Fuzzy Rules 



Figure 4.9. Comparison between fuzzy and nonfuzzy rules in Example 4-3: (a) 
Antecedent MFs and (b) input-output curve for nonfuzzy rules ; (c) Antecedent MFs 
and (d) input-output curve for fuzzy rules. (MATLAB file: sugl.m) 

Example 4.4 Two-input single-output Sugeno fuzzy model 

An example of a two-input single-output Sugeno fuzzy model with four rules can 
be expressed as 

( If X is small and Y is small then z = — x + y + 1. 

If X is small and Y is large then z = —y + 3. 

If X is large and Y is small then z = —x + 3. 

If X is large and Y is large then z = x + y + 2. 

Figure 4.10(a) plots the membership functions of input X and Y, and Figure 4.10(b) 
is the resulting input-output surface. The surface is complex, but it is still obvious 
that the surface is composed of four planes, each of which is specified by the output 
equation of a fuzzy rule. 


Unlike the Mamdani fuzzy model, the Sugeno fuzzy model cannot follow the 
compositional rule of inference (Section 3.4.1) strictly in its fuzzy reasoning mech- 
anism. This poses some difficulties when the inputs to a Sugeno fuzzy model are 
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Figure 4.10. Two-input single-output Sugeno fuzzy model in Example 4 - 4 : ( a ) 
antecedent and consequent MFs; (b) overall input-output surface. (MATLAB file: 
sug2 . m) 


fuzzy. Specifically, we can still employ the matching of fuzzy sets, as shown in the 
antecedent part of Figure 3.14, to find the firing strength of each rule. However, 
the resulting overall output via either weighted average or weighted sum is always 
crisp; this is counterintuitive since a fuzzy model should be able to propagate the 
fuzziness from inputs to outputs in an appropriate manner. 

Without the time-consuming and mathematically intractable defuzzification op- 
eration, the Sugeno fuzzy model is by far the most popular candidate for sample- 
data-based fuzzy modeling, which is introduced in Chapter 12. 

4.4 TSUKAMOTO FUZZY MODELS 

In the Tsukamoto fuzzy models [11], the consequent of each fuzzy if-then rule 
is represented by a fuzzy set with a monotonical MF, as shown in Figure 4.11. As 
a result, the inferred output of each rule is defined as a crisp value induced by 
the rule’s firing strength. The overall output is taken as the weighted average of 
each rule’s output. Figure 4.11 illustrates the reasoning procedure for a two-input 
two-rule system. 

Since each rule infers a crisp output, the Tsukamoto fuzzy model aggregates each 
rule’s output by the method of weighted average and thus avoids the time-consuming 
process of defuzzification. However, the Tsukamoto fuzzy model is not used often 
since it is not as transparent as either the Mamdani or Sugeno fuzzy models. The 
following is a single-input example. 


Example 4.5 Single-input Tsukamoto fuzzy model 
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Min or 
Product 



Figure 4.11. The Tsukamoto fuzzy model. 


An example of a single-input Tsukamoto fuzzy model can be expressed as 

( If X is small then Y is C\ 

< If X is medium then Y is C 2 
^ If X is large then Y is C 3 , 

where the antecedent MFs for “small,” “medium,” and “large” are shown in Fig- 
ure 4.12(a), and the consequent MFs for U C\” C 2 ,” and “C 3 ” are shown in Fig- 
ure 4.12(b). The overall input-output curve, as shown in Figure 4.12(d), is equal 
to (X)i=i w ifi)/(J2^=i w i)-> where fi is the output of each rule induced by the firing 
strength Wi and MF for C{. If we plot each rule’s output fi as a function of x, we 
obtain Figure 4.12(c), which is not quite obvious from the original rule base and 
MF plots. 


□ 

Since the reasoning mechanism of the Tsukamoto fuzzy model does not follow 
strictly the compositional rule of inference, the output is always crisp even when 
the inputs are fuzzy. 


4.5 OTHER CONSIDERATIONS 

There are certain common issues concerning all the three fuzzy inference systems 
introduced previously, such as how to partition an input space and how to construct 
a fuzzy inference system for a particular application. We shall examine these issues 
in this section. 
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(a) Antecedent MFs (b) Consequent MFs 




X X 


Figure 4.12. Single-input single output Tsukamoto fuzzy model in Example 4-4 : 
(a) antecedent MFs; (b) consequent MFs ; (c) each rule’s output curve; (d) overall 
input-output curve. (MATLAB file: tsul.m) 


4.5.1 Input Space Partitioning 

Now it should be clear that the spirit of fuzzy inference systems resembles that 
of “divide and conquer” —the antecedent of a fuzzy rule defines a local fuzzy re- 
gion, while the consequent describes the behavior within the region via various 
constituents. The consequent constituent can be a consequent MF (Mamdani and 
Tsukamoto fuzzy models), a constant value (zero-order Sugeno model), or a lin- 
ear equation (first-order Sugeno model). Different consequent constituents result in 
different fuzzy inference systems, but their antecedents are always the same. There- 
fore, the following discussion of methods of partitioning input spaces to form the 
antecedents of fuzzy rules is applicable to all three types of fuzzy inference systems. 

• Grid partition: Figure 4.13(a) illustrates a typical grid partition in a two- 
dimensional input space. This partition method is often chosen in designing 
a fuzzy controller, which usually involves only several state variables as the 
inputs to the controller. This partition strategy needs only a small number 
of MFs for each input. However, it encounters problems when we have a 
moderately large number of inputs. For instance, a fuzzy model with 10 
inputs and 2 MFs on each input would result in 2 10 = 1024 fuzzy if-then 
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Figure 4.13. Various methods for partitioning the input space: (a) grid partition ; 
(b) tree partition ; (c) scatter partition. 


rules, which is prohibitively large. This problem, usually referred to as the 
curse of dimensionality, can be alleviated by the other partition strategies. 

• Tree partition: Figure 4.13(b) shows a typical tree partition, in which each 
region can be uniquely specified along a corresponding decision tree. The 
tree partition relieves the problem of an exponential increase in the number 
of rules. However, more MFs for each input are needed to define these fuzzy 
regions, and these MFs do not usually bear clear linguistic meanings such as 
“small,” “big,” and so on. In other words, orthogonality holds roughly in 
X x Y, but not in either X or Y alone. Tree partition is used by the CART 
(classification and regression tree) algorithm, as discussed in Chapter 14. 

• Scatter partition: As shown in Figure 4.13(c), by covering a subset of 
the whole input space that characterizes a region of possible occurrence of 
the input vectors, the scatter partition can also limit the number of rules to 
a reasonable amount. However, the scatter partition is usually dictated by 
desired input-output data pairs and thus, in general, orthogonality does not 
hold in X, Y or 1x7. This makes it hard to estimate the overall mapping 
directly from the consequent of each rule’s output. 

Note that Figure 4.13 is based on the assumption that MFs are defined on the 
input variables directly. If MFs are defined on certain transformations of the input 
variables, we could end up in a more flexible partition style. Figure 4.14 is an 
example of the input partition when MFs are defined on linear transformations of 
the input variables. 

4.5.2 Fuzzy Modeling 

By now the reader should have already developed a clear picture of both the struc- 
tures and operations of several types of fuzzy inference systems. In general, we 
design a fuzzy inference system based on the past known behavior of a target sys- 
tem. The fuzzy system is then expected to be able to reproduce the behavior of the 
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Figure 4.14. Input space partition when MFs are defined on linear transformations 
of input variables: (a) grid partition; ( b ) tree partition; (b) scatter partition. 


target system. For example, if the target system is a human operator in charge of 
a chemical reaction process, then the fuzzy inference system becomes a fuzzy logic 
controller that can regulate and control the process. Similarly, if the target system 
is a medical doctor, then the fuzzy inference becomes a fuzzy expert system for 
medical diagnosis. 

Let us now consider how we might construct a fuzzy inference system for a 
specific application. Generally speaking, the standard method for constructing a 
fuzzy inference system, a process usually called fuzzy modeling, has the following 
features: 

• The rule structure of a fuzzy inference system makes it easy to incorporate 
human expertise about the target system directly into the modeling process. 
Namely, fuzzy modeling takes advantage of domain knowledge that might 
not be easily or directly employed in other modeling approaches. 

• When the input-output data of a target system is available, conventional 
system identification techniques can be used for fuzzy modeling. In other 
words, the use of numerical data also plays an important role in fuzzy 
modeling, just as in other mathematical modeling methods. 

In what follows, we shall summarize some general guidelines concerning fuzzy 
modeling. Specific examples of fuzzy modeling for various applications can be found 
in subsequent chapters. 

Conceptually, fuzzy modeling can be pursued in two stages, which are not totally 
disjoint. The first stage is the identification of the surface structure, which 
includes the following tasks: 

1. Select relevant input and output variables. 

2. Choose a specific type of fuzzy inference system. 

3. Determine the number of linguistic terms associated with each input and 
output variables. (For a Sugeno model, determine the order of consequent 
equations.) 
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4. Design a collection of fuzzy if-then rules. 

Note that to accomplish the preceding tasks, we rely on our own knowledge 
(common sense, simple physical laws, and so on) of the target system, information 
provided by human experts who axe familiar with the target system (which could 
be the human experts themselves), or simply trial and error. 

After the first stage of fuzzy modeling, we obtain a rule base that can more or 
less describe the behavior of the target system by means of linguistic terms. The 
meaning of these linguistic terms is determined in the second stage, the identifica- 
tion of deep structure, which determines the MFs of each linguistic term (and 
the coefficients of each rule’s output polynomial if a Sugeno fuzzy model is used). 
Specifically, the identification of deep structure includes the following tasks: 

1. Choose an appropriate family of parameterized MFs (see Section 2.4). 

2. Interview human experts familiar with the target systems to determine the 
parameters of the MFs used in the rule base. 

3. Refine the parameters of the MFs using regression and optimization tech- 
niques. 

Task 1 and 2 assume the availability of human experts, while task 3 assumes 
the availability of a desired input-output data set. Various system identification 
and optimization techniques for parameter identification in task 3 are detailed in 
Chapters 5, 6, and 7. A specific network structure that facilitates task 3 is covered in 
Chapter 12. When a fuzzy inference system is used as a controller for a given plant, 
then the objective in task 3 should be changed to that of searching for parameters 
that will generate the best performance of the plant; this aspect of fuzzy logic 
controller design is explored in Chapters 17 and 18. 


4.6 SUMMARY 

This chapter presents three of the frequently used fuzzy inference systems: the 
Mamdani, Sugeno, and Tsukamoto fuzzy models. We discuss their strengths and 
weaknesses and other related issues, such as input space partitioning and fuzzy 
modeling. 

Fuzzy inference systems are the most important modeling tool based on fuzzy set 
theory. Conventional fuzzy inference systems are typically built by domain experts 
and have been used in automatic control, decision analysis, and expert systems. 
Optimization and adaptive techniques expand the applications of fuzzy inference 
systems to fields such as adaptive control, adaptive signal processing, nonlinear 
regression, and pattern recognition. Chapter 12 discusses adaptive fuzzy inference 
systems; their applications are covered in Chapter 19. 
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EXERCISES 


1. Derive the fuzzy reasoning mechanism shown in Figure 4.3 by choosing the max 
and the algebraic product for the T-norm and T-conorm operators, respectively. 

2. Give formulas for the three defuzzification strategies [Equations (4.1) to (4.3)] 
when we have a finite universe of discourse X = (rri, . . . , rr n }, where X\ < • • • < 
x n . 

3. Use the three defuzzification strategies in Equations (4.1), (4.2), and (4.3) to 
find the representative values of a fuzzy set A defined by 

/jla{x) = trapezoid(ar, 10, 30, 50, 90). 


4. Repeat the previous exercise, but assume that the universe of discourse X 
contains integers from 0 to 100. 

5. Modify the program maml .m in Example 4.1 such that you can click and drag a 
corner of a trapezoidal MF to change its shape and see the interactive changes 
of the overall input-output curves. 

6. Change the MFs in Example 4.2 to trapezoidal ones and plot the overall input- 
output surface. 

7. Modify Example 4.3 such that only constant terms are retained in the conse- 
quent. Repeat the plots in Figure 4.9. 

8. Modify Example 4.3 by adding a second-order term to the consequent equation 
of each rule. Repeat the plots in Figure 4.9. 

9. In Example 4.4, use the following different definitions of MFs: 


Msmall, = sig(x,[-5,0]), 

/*large a [5)0]), 

Ismail, = si g(#.[-2,0]), 
^large w [2»0])t 


and repeat the plots in Figure 4.10. 


(4.4) 


10. Repeat the previous exercise but use weighted sum instead weighted average 
to derive the final output. Do you get exactly the same input-output surface 
as that in the previous exercise? Why? 
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Chapter 5 


Least-Squares Methods for 
System Identification 


J.-S. R. Jang 

5.1 SYSTEM IDENTIFICATION: AN INTRODUCTION 

The problem of determining a mathematical model for an unknown system (also 
referred to as the target system) by observing its input-output data pairs is gen- 
erally referred to as system identification. The purposes of system identification 
are multiple: 

• To predict a system’s behavior, as in time series prediction and weather fore- 
casting. 

■ To explain the interactions and relationships between inputs and outputs of a 
system. For example, a mathematical model can be used to examine whether 
the demand indeed varies proportionally to the supply in an economic system. 

• To design a controller based on the model of a system, as in aircraft and ship 
control. Also to do computer simulation of the system under control, you need 
a model of the system. 

System identification generally involves two top-down steps: 

Structure identification In this step, we need to apply a priori knowledge about 
the target system to determine a class of models within which the search for 
the most suitable model is to be conducted. Usually this class of models is de- 
noted by a parameterized function y = /( u; 0), where y is the model’s output, 
u is the input vector, and 0 is the parameter vector. The determination of the 
function / is problem dependent, and the function is based on the designer’s 
experience and intuition and the laws of nature governing the target system. 

Parameter identification In the second step, the structure of the model is known 
and all we need to do is apply optimization techniques to determine the pa- 
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Target System 
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Identification 

Techniques 


y>- y> 


Figure 5.1. Block diagram for parameter identification. 

rameter vector 0 = 0 such that the resulting model y = /( u; 0) can describe 
the system appropriately. 

If we do not have any a priori knowledge about the target system, then struc- 
ture identification becomes a difficult problem and we have to select the structure 
by trial and error. Fortunately, we know a great deal about the structures of most 
engineering systems and industrial processes; usually it is possible to derive a spe- 
cific class of models — namely, a parameterized function — that can best describe the 
target system. Consequently, the system identification problem is usually reduced 
to that of parameter identification. The problem of parameter identification is thus 
of great importance, and accordingly this chapter is devoted mostly to this class of 
problem. 

Figure 5.1 illustrates a schematic diagram of parameter identification, where an 
input Ui is applied to both the system and the model, and the difference between 
the target system’s output yi and the model’s output yi is used in an appropriate 
manner to update a parameter vector 0 to reduce this difference. Note that the 
data set composed of m desired input-output pairs ( u^yi ), * = 1, • • • ,m, is often 
called the training data set or sampled data set. In the most general case, Ui 
and yi represent the desired input and output vectors, respectively. 

In general, system identification is not a one-pass process; it needs to do both 
structure and parameter identification repeatedly until a satisfactory model is found, 
as follows: 

1. Specify and parameterize a class of mathematical models representing the 
system to be identified. 

2. Perform parameter identification to choose the parameters that best fit the 
training data set. 

3. Conduct validation tests to see if the model identified responds correctly to 
an unseen data set. (This data set is disjoint from the training data set and 
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is referred to as the test, validating, or checking data set.) 

4. Terminate the procedure once the results of the validation test are satisfac- 
tory. Otherwise, another class of models is selected and steps 2 through 4 axe 
repeated. 

Before delving into the core parts of neuro-fuzzy and soft computing, we shall 
introduce a class of standard least-squares methods for linear models in system iden- 
tification. Least-squares methods are powerful and well-developed mathematical 
tools that have been proposed and used in a variety of areas for decades, including 
adaptive control, signal processing, and statistics. Nowadays they still prove to be 
essential and indispensable tools for constructing linear mathematical models. The 
same fundamental concepts can be extended to nonlinear models as well. Thus it 
suffices to say that these linear least-squares methods provide the most basic and 
important mathematical foundation for solving neuro-fuzzy modeling problems in 
subsequent chapters. 

Throughout this chapter, we shall restrict ourselves to the identification of linear 
models and static (or memoryless) systems. By linear models, we mean models that 
are linear in their parameters. Thus a linear model may be nonlinear in its inputs. 
The least-squares methods provide us with mathematical procedures by which a 
linear model can achieve a best fit to experimental data in the sense of least-squared 
error. Nonlinear models that are intrinsically linear can also take advantage of 
the least-squares methods, as explained in Section 5.8. For intrinsically nonlinear 
models, a thorough discussion can be found in Section 6.8 of Chapter 6. 

By static systems, we mean that the output of the target system depends on its 
current inputs only; it does not depend on the history of inputs. This assumption 
does not impair the generality of our discussion since in the discrete time domain, 
the output of a dynamic system can be treated as a static mapping of its current 
inputs and (several) previous states, assuming they are available. 

In what follows, we begin by briefly reviewing some techniques of matrix ma- 
nipulation and calculus that will be used throughout this chapter. In Section 5.3 
we explain how to find the least-squares estimator of a linear model; the intuitive 
geometric interpretation of the estimator is discussed in Section 5.4. Sections 5.5 
and 5.6 give on-line formulas for the least-squares methods to save computation 
time and deal with time- varying target systems. Statistical properties of the least- 
squares estimator and its close relationship with the maximum likelihood estimator 
are discussed in Section 5.7, which is optional for the first reading. 

5.2 BASICS OF MATRIX MANIPULATION AND CALCULUS 

Since formulas for the least-squares estimator and derivative-based optimization 
methods are much more concise in matrix notation, it would be helpful for us to 
review briefly a few matrix manipulation techniques. (However, this chapter is 
not intended as an introduction to matrix theory; the reader is expected to have 
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basic knowledge about linear algebra.) Most of the lemmas introduced here axe 
straightforward and thus no proofs are given. The reader is encouraged to contrive 
simple examples to try out the lemmas introduced. 

To minimize ambiguity and enhance readability, our matrix notation follows 
these guidelines: 

• Matrices are represented by bold capital letters, such as A, B, X, and Y. 

• Vectors are always assumed to be column vectors unless otherwise specified, 
and they are represented either by lowercase boldface letters (such as a, b, x, 
and y) or by lowercase letters with explicit vector symbols (such as a, b, x, 
and y). 

• Scalars and scalar- valued functions are represented in nonbold letters, such as 
a, B , x, and Y. 

In what follows, we introduce several definitions and lemmas concerning matrix 
manipulation and calculus. 

Lemma 5.1 A property of matrix transpose 

Let A and B be compatible matrices. Then 

(AB) t = b t a t , 

where the superscript t denotes the transpose of a matrix. 


□ 


Lemma 5.2 A property of matrix inverse 

Let A and B be compatible and nonsingular matrices. Then 

(AB)" 1 =B- 1 A" 1 , 

where the superscript — 1 denotes the inverse of a matrix. 


□ 


Definition 5.1 Block form 

A matrix in block form is regarded as a matrix containing blocks of smaller ma- 
trices; these blocks are dictated by the applications in which the matrix arises. 


□ 
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For example, the matrix 


A = 


1 2 3 
4 5 6 
7 8 9 


can be written in block form as 


A = 


Ai A2 

A3 A4 ’ 


where 


and 


Ai 


1 2 
4 5 ’ 


A2 


3 

6 ’ 


As = [ 7 8 ] , 


A 4 = [ 9 ] . 


Note that any m x n matrix can be viewed either as a row of n column vectors or 
as a column of m row vectors. In symbols, 


A = 


ai 



where a* is the ith column of A , or 


A = 


*771 


5 


where a J is the ith row of A. (Thus a; may denote either the zth column of A or 
the transpose of the ith row of A; its meaning should be obvious from the context.) 


Lemma 5.3 Transpose and product of matrices in block form 
Let A and B be two matrices in block form: 



A t 


A l A% ' 

. A? AT 


Then 
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and 

A1B1 + A2B3 A1B2 + A2B4 

A - [ A3B1 + A4B3 A3B2 + A4B4 * 

Namely, the matrices can be transposed and multiplied in the same way as if the 
blocks were scalars, provided that the individual products are defined. 

Definition 5.2 Gradient of a scalar function. 

Let x = [xi, • • • , x n ] T and let /(x) be a scalar function of x. Then the derivative 
of /(x) with respect to x, called the gradient vector or gradient of /(x), is a 
column vector denoted by 


V/(x) 


' df{x)/dx 1 
_ df{x)/dx n 


□ 


Definition 5.3 Jacobian of a vector function 


Let x = [xi, •••, x n ] T and let f(x) be a vector function of x, denoted by f(x) 
= [/i(x), • • • , / m (x)] T . Then the derivative of f(x) with respect to x, called the 
Jacobian matrix or Jacobian of f(x), is an m x n matrix denoted by 


Jf = 





‘ V T /x(x) ' 


[ g H 


. V T / m (x) _ 


0 Q 

1 


□ 


The Jacobian of a vector function f(x) is sometimes denoted by df(x)/< 9 x T . 
Definition 5.4 Hessian of a scalar function 

Let x = [xi, • • • , x n ] T and let /(x) be a scalar function of x. Then the second 
derivative of f(x) with respect to x, called the Hessian matrix or Hessian of 
/(x), is an n x n matrix denoted by 
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In other words, the Hessian can be expressed as 


r a 

(df\ 

d 



1 

^Icg 

h 

> . 
i 

dxi 

\dxi) 

dx n 

\dxi) 


d 

( gf \ 

d 

( \ 


1 

^lcS 

h 

.. > 

. dxi 


dx n 

\$x n ) J 



The preceding equation states that the Hessian of a scalar function /(x) is 
the Jacobian of the gradient of the function. Other commonly used notation for 


a Hessian matrix is V 2 /(x) or 
continuous. 



Note that H/ is symmetric if /(x) is 


Example 5.1 Gradient of a linear function 

Let c = [ci, • • • , c n ] T and x = [a?i, • • • , x n ] T . Then the gradient of a linear scalar 
function /(x) = c T x = x T c is 


g / = V/(x) = c. 


□ 


Definition 5.5 Quadratic form 

Let A = [aij] nXn be square matrix and x = [xi, •••, x n ] T be a column vector. 
Then the quadratic form in x with matrix A is 

n n n 

X T Ax = a ii X l + ^ 2 a ij x i x j- 

i = 1 *=1 j=l,j^i 


□ 


Note that A can be assumed symmetric without loss of generality; this is shown 
by the following identity: 


x t Ax = x T 


A + A T 


symmetric 


□ 


Lemma 5.4 Gradient and Hessian of a quadratic form 

Suppose /(x) = x T Ax is the quadratic form in Definition 5.5. Then the gradient 
vector is 

{ (A + A T )x, if A is not symmetric, 

2Ax, if A is symmetric. 
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The Hessian matrix is 

jj _ f A + A T , if A is not symmetric, 

* ~ \ 2A, if A is symmetric. 

□ 


Definition 5.6 Positive definite matrices 

A matrix A is positive definite, denoted by A > 0, if x T Ax > 0 for all x ^ 0. 
Alternatively, A > 0 if all its eigenvalues are positive. 


□ 

Conversely, a matrix A is negative definite if x T Ax < 0 for all x ^ 0, or if all 
its eigenvalues axe negative. 

Lemma 5.5 Optimum of a quadratic function 

Any quadratic functions in /(x) can be expressed in matrix notation as follows: 

/(x) = x t Ax -I- 2b T x c, (5.1) 

where A is a symmetric matrix. If A is positive definite, then /(x) achieves its 
minimum at x = — A -1 b. On the other hand, /(x) achieves its maximum at x = 
— A -1 b if A is negative definite. 


□ 

The preceding lemma can be verified directly by setting the derivative of /(x) 
to zero. Another method for finding the minimum of /(x) is explored in Exercise 2. 

The definitions of the gradient, the Jacobian and the Hessian can lead to a 
number of identities. The following axe formulas for finding gradient vectors (with 
respect to x) of scalar functions. These useful formulas axe to be referred in the 
subsequent chapters. 


V(x T y) = V (y T x) = y 

(5.2) 

V (x T x) = 2x 

(5.3) 

V (x T Ay) = Ay 

(5.4) 

V (y T Ax) = A T y 

(5.5) 

V(x t Ax) = (A 4- A t )x 

(5.6) 



Sec. 5.2. Basics of Matrix Manipulation and Calculus 


103 


V[f T (x)g(x)] = V[g T (x)f(x)] = Jjg + j£f (5.7) 


V[g T (x)Qg(x)] = 2JgQg(x) if Q is symmetric (5.8) 

The following matrix inversion formula is useful when we want to find the inverse 
of a matrix of a specific form. 


Lemma 5.6 Matrix inversion formula 

Let A and I + CA -1 B be nonsingular square matrices. Then 

(A + BC)" 1 = A -1 - A _1 B(I + CA -1 B) -1 CA -1 . (5.9) 

Proof: By direct substitution, we have 

(A + BC)[A -1 - A _1 B(I + CA- 1 B)- 1 CA~ 1 ] 

= (A + BC)A _1 - (A + BC)A -1 B(I + CA^B^CA" 1 
= I + BCA 1 - (B + BCA -1 B)(I + CA 1 B) 1 CA 1 
= I + BCA 1 - B(I + CA _1 B)(I + CA 1 B) 1 CA 1 
= I + BCA 1 - BCA 1 
= I. 


□ 

Almost all derivative-based optimization techniques (see Chapter 6) employ the 
concept of the Taylor series expansion, which is defined next. 


Definition 5.7 Taylor series expansion 


Let /(x) be a real- valued differentiable scalar function of a vector x = [xi, • • • , x n ] T . 
Then the Taylor series expansion of /(•) at x, with respect to a small deviation 
d = [di, • • • , d n ] T , can be expressed as 


/(x + d) = /(x) + ^2 

i = 1 


a/(x) 

dxi 


1 n n 

^ EE 


i = 1 j = 1 


a 2 /(x) 

dxidxj 


didj+ H.O.T. 


(5.10) 


H.O.T. means “higher-order terms”; they are seldom used in practical situations 
since they can be neglected if the deviation ||d|| is sufficiently small. If we use the 
gradient g and the Hessian H of the function /(•) at x, and omit the higher order 
terms, then the preceding equation can be rewritten as 


/(x + d) « /(x) + g T d + id T Hd. (5.11) 

This states that when the deviation d from x is small, then the behavior of function 
/(•) near x is close to a quadratic function in terms of the deviation d. 
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□ 

An interesting example for demonstrating Taylor series expansion of a single 
variable function is the MATLAB file taylor.m, which is available via FTP or 
WWW (see page xxiii). After executing this program within MATLAB you can 
click any point in the figure window to display the first-order (a straight line), 
second-order (a hyperbola), and third-order Taylor series approximations to the 
original curve. Moreover, you can click and drag on any of the ten control points 
of the curve to change its shape interactively. (In fact, the curve is the ninth-order 
least-squares polynomial of these ten control points. This is explained in the next 
section.) 

5.3 LEAST-SQUARES ESTIMATOR 

In the general least-squares problem, the output of a linear model y is given by the 
linearly parameterized expression 

y = 0i/i(u) + 02 f 2 (u) + * • • + 0„/„(u), (5.12) 

where u = [m, • • • , u p ] T is the model’s input vector, /i, • • • , f n are known func- 
tions of u, and 0i , • • • , 6 n are unknown parameters to be estimated. In statistics, 
the task of fitting data using a linear model is referred to as linear regression. 
Thus Equation (5.12) is also called the regression function, and the 0*’s are called 
the regression coefficients. 

To identify the unknown parameters 0*, usually we have to perform experiments 
to obtain a training data set composed of data pairs {(u*; y*), i = 1, • • • , m}; they 
represent desired input-output pairs of the target system to be modeled. Substitut- 
ing each data pair into Equation (5.12) yields a set of m linear equations: 

/i(ui)0i +/ 2 (ui)0 2 + ••• + /„ (ui)0 n = yi, 

/l(u 2 )0l + / 2 (u 2 ) 0 2 H + /n(u 2 )0 n = y 2 , 

< : : : (5-13) 

^ /l(Ujn)01 "H y r 2(u.»n)02 “ F * * ' “ /n(u m )0 n — 2/m* 

Using matrix notation, we can rewrite the preceding equations in a concise form: 

A0 = y, (5.14) 

where A is an m x n matrix (sometimes called the design matrix): 

fn( ui) 

: : ’ 

/n(U-m) 


/i(ui) 


/l(u m ) 
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0 is an n x 1 unknown parameter vector: 


0 = 


ft 


On 




and y is an m x 1 output vector: 


y = 


y i 

Vm 


The ith row of the joint data matrix [A : y], denoted by [af : is related to the 

ith input-output data pair (u*; yi) through 

= [/lW, •••, /ft(lli)]. 

Since most of our calculation is based on matrices A and y, sometimes we loosely 
refer to (a J\yi) as the ith data pair of the training data set. 

To identify uniquely the unknown vector 0 , it is necessary that m > n. If A is 
square (m = n) and nonsingular, then we can solve x from Equation (5.14) by 

0 = A _1 y. (5.15) 

However, usually m is greater than n, indicating that we have more data pairs than 
fitting parameters. In this case, an exact solution satisfying all the m equations is 
not always possible, since the data might be contaminated by noise, or the model 
might not be appropriate for describing the target system. Thus Equation (5.14) 
should be modified by incorporating an error vector e to account for random noise 
or modeling error, as follows: 

A0 + e = y. (5.16) 

Now, instead of finding the exact solution to Equation (5.14), we want to search for 
a 0 = 0 which minimizes the sum of squared error defined by 

m 

E ( 0 ) = ' 52 (yi~ 9 f = eTe = (y - A #) T (y - A #), (5.17) 

i = 1 

where e = y — A0 is the error vector produced by a specific choice of 0. Note 
that E{0) is in quadratic form and has a unique minimum at 0 — 0. The following 
theorem states a necessary condition satisfied by the least-squares estimator 0. 

Theorem 5.1 Least-squares estimator 
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The squared error in Equation (5.17) is minimized when 0 = 0, called the least- 
squares estimator (LSE for short), which satisfies the normal equation 

A t A 0 = A T y. (5.18) 

If A t A is nonsingular, 0 is unique and is given by 

0 = (A T A)- 1 A T y. (5.19) 


Proof: There are a number of methods available in the literature for finding the 
least-squares estimator for Equation (5.14). One straightforward approach is to set 
the derivative of E{0) with respect to 0 equal to zero. Noting that 0 T A T y = y T AO 
is a scalar, we can expand E(0): 

E{0 ) = (y T - 0 T A T )(y - A0 ) = 0 T A T A0 - 2y T A0 + y T y. (5.20) 


Then the derivative of E(0) is 


OE{0) 

00 


2A t A0 - 2A r y. 


By setting 


dE{9) 

06 


= 0 at 0 = 0, we obtain the normal equation 


A t A 0 = A T y. 

If A T A is nonsingular, then 0 can be solved uniquely: 

0 = (A T A)- 1 A T y. 


( 5 . 21 ) 


(5.22) 


(5.23) 


□ 


LSE can also be obtained directly from Lemma 5.5 since E{0) in Equation (5.20) 
is a quadratic function of 0. The least-squared error achieved by 0 = 0 can be found 
to be 

E{«) = (y - A0) T (y - A0) = y T y - y T A(A T A)- 1 A T y. (5.24) 

However, if A T A is singular, then the LSE is not unique and we have to employ 
the concept of generalized inverse to find 0. Readers interested in this aspect of 
the problem will find a detailed treatment of it in the literature, such as in Chapter 
5 of [10]. Without loss of generality, we shall assume that A T A is nonsingular 
throughout this chapter. 

The foregoing derivation is based on the assumption that every element of the 
error vector e has the same weight toward the overall squared error. A further 
generalization is to let each error term be weighted differently. Specifically, let W 
be the desired weighting matrix, which is symmetric and positive definite. Then 
the weighted squared error is 

E w (6) = (y - A0) T W(y - A 9). 


(5.25) 
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Table 5.1. Training data for the spring example. 


Experiment 

Force (newtons) 

Length of Spring (inches) 

1 

1.1 

1.5 

2 

1.9 

2.1 

3 

3.2 

2.5 

4 

4.4 

3.3 

5 

5.9 

4.1 

6 

7.4 

4.6 

7 

9.2 

5.0 


Minimizing E\v (0) with respect to 0 yields the weighted least-squares estima- 
tor 0\y: 

0w = (A T WA)- 1 A T Wy. (5.26) 

Obviously, 0w reduces to 0 when W is chosen as an identity matrix. 

Example 5.2 Least-squares estimator 


From Hooke’s law, we know that when a force is applied to a spring constructed of 
uniform material, the change in the length of the spring is proportional to the force 
applied. Therefore, we have the following expression governing the relationship 
between a spring’s length l and an applied force /: 


l = k 0 + kif, 


(5.27) 


where k 0 represents the length of the spring with no force applied and ki (the 
spring constant) represents the change in length when a unit of force is applied. 
To identify k 0 and ki for a particular spring, ideally we can apply two different 
forces and observe the corresponding lengths of the spring. Then the values of ko 
and ki can be determined uniquely by solving two simultaneous linear equations in 
two unknowns. However, this approach is sensitive to measurement error or noise, 
and therefore is not preferred. To identify ko and ki accurately, usually we apply 
several different forces and record the corresponding lengths of the spring. Here we 
suppose that the data pairs obtained are as listed in Table 5.1. 

Substituting each row of Table 5.1 into Equation (5.27) and incorporating an 
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error vector e, we have 


' 1 

1.1 ■ 


’ ei ' 


‘ 1.5 ' 

1 

1.9 


e2 


2.1 

1 

3.2 


k 0 


C3 


2.5 

1 

4.4 


+ 

e 4 

. = 

3.3 

1 

5.9 



^5 


4.1 

1 

7.4 


C6 


4.6 

1 

9.2 


. e 7 . 


5.0 


y 


Therefore, the least-squares estimator of [ ko , k±] T which minimizes e T e = X]J=i e l 
is equal to 

£° 1 =(A r A)- 1 A r y = 

L «i J 

Figure 5.2(a) shows the least-squares line that minimizes the squared error. It is 
obvious that as we have more data points, the resulting least-squares estimator is 
less susceptible to measurement error or noise. 


1.20 

0.44 


□ 

In the preceding example, we determine the structure of the model [Equa- 
tion (5.27)] according to Hooke’s law. If the current model is not suitable for 
describing the spring’s behavior, then we can increase the model’s degrees of free- 
dom by introducing terms of higher orders: 

l — ko + k\f + + * * * + k n f n . (5.28) 

The same identification procedure can be performed to find the LSE for 0 = 
[ko , k \ , •••, k n ] T , which results in a least-squares polynomial that minimizes 
the squared error. Figures 5.2(b) through 5.2 (d) are least-squares polynomials with 
order 2, 3, and 4, respectively. 

Note that the squared error always decreases as the order of the least-squares 
polynomial increases. However, although it fits the training data better, a poly- 
nomial with a higher order does not always reflect the true characteristics of the 
system in question. This caveat is demonstrated in Figures 5.2(c) and 5.2(d), where 
the spring’s length is getting shorter when subject to a force 10 N or more. This is 
an obvious contradiction to our empirical knowledge of a spring’s behavior. 

Another demonstration of least-squares polynomials is the MATLAB program 
taylor .m, where you can click on any of the ten control points to change the shape 
of the ninth-order least-squares polynomial. The least-squares polynomial always 
fit the data perfectly, but it is not robust — a small amount of noise in the data set 
could change the whole curve dramatically and make it untrustworthy. 

An easy way to select a polynomial of suitable order is to apply another input- 
output data set, called the validating or test data set, that was not used in 
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(a) First-order Polynomial (b) Second-order Polynomial 






Figure 5.2. Fitting data through least-squares polynomials. (MATLAB file: 
spring. m) 


constructing the least-squares polynomial. This test data set can verify the gener- 
alization capability of the resulting models and thus provide an unbiased index 
for selecting the best model. Many other approaches to determining a model’s order 
have been proposed in the literature. For a thorough treatment, see, for example, 
Chapter 6 of [2]. 

If the target system has q outputs, expressed as y = [j/i, • • • y q ] T with q > 1, 
then we have a set of linear equations in matrix form: 

A© + E = Y, 

where A is an m x n matrix, as introduced previously: 

/i(ui) ••• /„(ui) 

^ = : : > 

© is an nx q unknown parameter matrix: 

0ii ••• " 

: : : > 

0nl ' ' ' @nq 


0 = 




110 


Least-Squares Methods for System Identification Ch. 5 


and 



2/n 

• * • Vlq 

Y = 

j 

• • 


Uml 

Umq 


is an m x q output matrix with yij denoting the j th output value in the ith data 
pairs. 

Now we want to minimize a similar squared error 


E{«) = T.tiY.U e l 

= Ej=i(E™i4)- 

Note that Y^iLi e lj is the squared length of the jth column of matrix E, which 
depends on the jth column of 0 only. Hence 



j = i 


3 = 1 


where 6j is the jth column of 0 . In other words, 0j = 0j, which minimizes Y^iLi e %i 
is the least-squares estimator to the subproblem 

A0j + ej = yj, 

where ej and y j are the jth columns of E and Y, respectively. As a result, 

0j = (A T A)- 1 A T y j , 

and 

© = (A t A) _1 A t Y. 

This implies that the optimality occurs when each column of A© is equal to the 
projection of the corresponding column of Y onto the space spanned by the columns 
of A (see the following section on the geometric interpretation of the LSE), so the 
calculation of each 0j can be performed independently. 


5.4 GEOMETRIC INTERPRETATION OF LSE 

Now we shall discuss the geometric interpretation of the least-squares estimator. 
Let A be expressed as a row of n column vectors of size m x 1, as follows: 


A = 


&i • • • a n 


Then we have 

ai • • • 


3-n 


< 3 ? 

1 

= 01 



1 

1 



ai 

+ " • + 

B-n 

. 




A0 = 


(5.29) 



Sec. 5.4. Geometric Interpretation of LSE 


111 



Figure 5.3. Geometrical interpretation of the least-squares estimator. 


In other words, A 0 is a linear combination of the basis vector {ai, . . . , a n } in an 
ra-dimensional space. For AO to approximate y in a least-squares sense, clearly AO 
should be equal to the projection of y onto the space spanned by {ai, . . . , a n }. 

To illustrate this concept, Figure 5.3 shows the situation where n = 2 and m = 
3. Here we have 


A = 


ai a2 


and AO = 6 \a.i + 62 ^ 2 - Note that AO always stays on the plane spanned by ai 
and a 2 . Thus for e = y — AO to achieve a minimum length, AO must be equal to 
the projection of y onto the plane spanned by ai and a 2 . This optimality occurs 
when e = y — AO is orthogonal to both ai and a 2 , and it is called principle of 
orthogonality. In symbols, 


af (y - AO) = 0, (5.30) 

a^y — A0) = 0. (5.31) 

These two conditions reduce to 

A r (y - AS) = 0, (5.32) 

which is exactly the normal equation [see Equation (5.22), Theorem 5.1] derived by 
direct differentiation. 

If we use proj A (y) to denote the projection of vector y onto the space spanned 
by the columns of A, then 

proj A (y) = A 0 = A(A T A) -1 A T y. 


(5.33) 
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H = A(A t A)- 1 A T can be viewed as a projection operator; the complementary 
orthogonal operator is M = I — A(A T A) -1 A T . Accordingly, we have y = Hy 
and e = My. (H is also called the hat operator since it puts a hat on y.) 

The geometric interpretation shown in Figure 5.3 is intuitive, yet it is indispens- 
able in confirming a number of important properties of the least-squares estimator 
0, some of which are as follows: 

• The minimal error measure is equal to the inner product of y and e at opti- 
mality. That is, 

E(0) = y T e. (5.34) 

• The optimal approximation of y, denoted by y (= AO), is orthogonal to the 
minimal-length error vector e. That is, 

y T e = 0. (5.35) 

• The minimal-length error vector e is equal to the result of applying the or- 
thogonal operator M on e. That is, 


e = Me = [I - A(A T A)- 1 A T ](y AO). (5.36) 


• Each column of A is invariant under the projection operator H. That is, 

HA = A, (5.37) 

or equivalently, 

MA = 0. (5.38) 

• The projection operator H = A(A T A) -1 A T is idempotent. That is, for any 
integer k greater than one, H satisfies 

H* = H. (5.39) 

In other words, for any vector y, we have 

H H Hy = Hy, 

V 

k times 

which means that once a vector has been obtained as the projection onto the 
subspace spanned by the columns of A, it cannot be changed by any further 
application of H. 

• The orthogonal operator M = I-H = I- A(A T A) _1 A T is also idempotent, 
for a reason similar to that given previously. 

Mathematical derivation of these properties is left as Exercise 7. 
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5.5 RECURSIVE LEAST-SQUARES ESTIMATOR 

The least-squares estimator derived in the previous section can be expressed as 

9 k = (A T A)- 1 A r y, (5.40) 


where we have left out the hat Q for simplicity. Here we assume the row dimensions 
for A and y are k ; thus a subscript k is added in the preceding equation to denote 
the number of data pairs used for the estimator 0 . The k also can be looked as 
a measure of time if the data pairs become available in sequential order. Suppose 
that a new data pair (a T ; y) becomes available as the (m -I- l)th entry in the data 
set. Then instead of using all the k -\- 1 available data pairs to recalculate the 
least-squares estimator Ok+i, we want to find a way of taking advantage of the Ok 
already available to obtain Ok+i with a minimum of effort. In other words, our 
task is to find a way of using the new data pair (a T ; y) to update Ok appropriately 
to find 0k +\ • This problem is called recursive least-squares identification and 
has been fully addressed in the literature [3, 7, 9]. 

Obviously, Ok+i can be expressed as 



(5.41) 


To simplify the notation, we introduce two nxn matrices P* and Pjfc+i defined by 


Pt = (A t A)-‘ 


Jfc+1 — 


A 

T 

Q 


n T r 


= {[ A T a ] 

= (A T A + aa T ) -1 
These two matrices are related by 


A 

T 

r > * 


-1 


P 


-1 

k 



— aa 


T 


Using Pjfc and Pfc+i> we have 


f Ok = PjfcA T y, 

{ Ok+i = Pjfc+i (A T y + a y). 


(5.42) 


(5.43) 


(5.44) 


(5.45) 


To express Ok+i in terms of Ok, we have to eliminate A T y in Equation (5.45). Prom 
the first equation in (5.45), we have 

A T y = P^0*. 


(5.46) 
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By plugging this expression into the second equation in (5.45) and applying Equa- 
tion (5.44), we have 

Ok+i = Pjfc + i(P* 1 d n + ay) 

= P fc+1 [(PjjJ, - aa T )0fc + ay] (5.47) 

= 0 \ fc + Pft+ia(y - aT Ok). 


Thus Ok+i can be expressed as a function of the old estimate Ok and the new 
data pairs (a T ; y). It is interesting to note that Equation (5.47) has an intuitive 
interpretation: The new estimator Ok+i is equal to the old estimator Ok plus a 
correcting term based on the new data (a T ; y); this correcting term is equal to an 
adaptation gain vector Pjfe+ia multiplied by the prediction error produced by 
the old estimator — that is, y — a T 0k- 

We are not done yet, however. Calculating Pfc+i by Equation (5.43) involves 
the inversion of an n x n matrix. This is computationally expensive and requires 
us to find an incremental formula for Pjfc+i. From Equation (5.44), we have 

Pjfc+i = (P* 1 + aa T ) -1 . (5.48) 

Applying the matrix inversion formula in Lemma 5.6 with A = P^ 1 , B = a, and 
C — a T , we obtain the following incremental formula for Pjfc+i: 


Pjfc+i 


P*- 

P*- 


P fc a(I + a T P fc a)- 1 a T P A 

P fc aa T Pjfe 
1 + a T P*a‘ 


(5.49) 


In summary, the recursive least-squares estimator for the problem of AO = 

y, where the kth (1 < k < m) row of [A : y], denoted by [a^ : y*], is sequentially 
obtained, can be calculated as follows: 


jfc+i 


0k + 1 


p _ Pfc a <H-l a iT+lPfc 

k 1 + a ft+l p * a *+l ’ 

@k "I" Pfc-(-l a A:-(-l (2/fc+l — a 2+i^fc)> 


(5.50) 


where k ranges from 0 to m — 1 and the overall LSE 0 is equal to 0 m , the estimator 
using all m data pairs. 

To start the algorithm in Equation (5.50), we need to select the initial values of 
0o and Po- One way to avoid determining these initial values is to collect the first 
n data points and solve 0 n and P n directly from 

r p„ = (a£a„)-\ 

\ = PnA^y n , 


where [A n : y n ] is the data matrix composed of the first n data pairs. We can then 
start iterating the algorithm from the (n + l)th data point. However, sometimes it 
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is more convenient to use the recursive formulas in Equation (5.50) throughout the 
identification process. To do so, notice that 

P* = (Po + A{A k )- 1 , (5.51) 

and the corresponding 0* is (see Exercise 10) 

0k =P*(A*y* + Po 1 0o), (5-52) 


where [A* : y fc ] is the data matrix composed of k data pairs. By choosing 

P 0 = al, 


we have 

lim Pq 1 = lim —I = 0. 

(*—►00 a— too Ot 

Therefore, by setting a equal to a large number, we can force Equations (5.51) 
and (5.52) arbitrarily close to Equation (5.50), regardless of 0q. In practice, 0q is 
usually a zero matrix for convenience. 

Remarks 


• The matrix P* is proportional to the covariance of the estimators. (See the 
Gauss-Markov theorem in Section 5.7 for more details.) 

• The least-squares estimator can be interpreted as a Kalman filter [4, 6] for 
the process 

/ 0(k + 1) = 0(k), 

\ y(k) = a T (k)0(k)+e(k), 

where k is a time index, e(k) is random noise, 0(k) is the state to be estimated, 
and y(k) is the observed output. [Note that a T (k) and y(k) are equal to ajT 
and yk, respectively, in Equation (5.50).] 


• When extra parameters are introduced for better performance, 0 will have 
more components and there will be additional columns in matrix A. It is 
possible to reduce the complexity of the calculation by employing recursion 
in the number of parameters. Interested readers are referred to Section 3.6 
of [5]. 

• The recursive LSE for systems with multiple outputs can be derived almost 
identically: 


fc+i 


€>*+ i 


p _ Pfc a fc+ i a fc+i-P* 

©fc + P*+ia*+i(yj[ +1 - a J +1 0»), 


where (a^;y^) is the fcth data pair. 


(5.53) 
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5.6 RECURSIVE LSE FOR TIME-VARYING SYSTEMS* 

From Equation (5.42), we have P* = (A T A) -1 , where k is the number of data pairs 
encountered so far, and it is also the row dimension of A. If k is greater than the 
number of fitting parameters and the data pairs contain rich enough information, 
then A t A is usually positive definite, and, as k goes to infinity, ^A T A approaches 
a nonsingular constant matrix. Therefore, we have 

which indicates that the adaptation gain P^+ia^+i in Equation (5.50) decreases 
at each iteration. This is a direct consequence of the squared error defined in 
Equation (5.17), which treats each error component equally. For time-invariant 
systems, this works well, since the decreasing adaptation means we are getting 
closer to the optimal point in the parameter space. However, for time-varying 
systems, this is not appropriate, since the decreasing adaptation gain cannot track 
the changing optimal parameters. One simple way to resolve this problem is to reset 
the matrix P* to Po occasionally, since the LSE converges rapidly to the current 
optimal parameters. Of course, the obvious time to reset P* is when we suspect 
that a significant parameter change has occurred. 

Another way to deal with time-varying systems is to introduce a forgetting 
factor A that places heavier emphasis on more recent data: 

171 

E(0) = J2 - *T 0? = (y - A0) T W(y - A0), (5.54) 

1 

where W is a diagonal matrix: 

• • • 0 "1 


•• 0 

0 1 J 

and 0 < A < 1. From Equation (5.26), the corresponding LSE that minimizes the 
preceding weighted error measure is defined by 

0 = (A T WA) -1 A T Wy. (5.55) 

To derive the formulas for a recursive LSE with a forgetting factor, again we 
define 

6 k = (A T WA)- 1 A r Wy, 

where [A : y] contains k data pairs and Ok is the LSE using these k data pairs. 


W = 


r A m -1 
0 

0 


0 

A m-2 


lim P fc 

k—¥ oo 


= lim 

k—¥ oo 


K 
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Then we have 


&k+i 


' A ‘ 

T 

' AW 0 ' 


' A ‘ 

V 

A 

T 

' AW 0 ' 


y 

T 

a 


° 1 


aT 

J 

T 

a 


° 1 


. y . 


(AA T WA + aa T ) 1 (AA T Wy + at/), 


where (a T ; y) is the (k + l)th data pair, which has just become available. To 
simplify the notation, we again introduce P * and Pfc+i, which are defined slightly 
differently from before: 

P* = (A t WA)-\ 


P*H-1 — 


' A ‘ 

T 

' AW 

0 ‘ 


' A ‘ 

T 

a 


0 

1 


T 

a 


-1 


= (AA J WA + aa T ) -1 , 


where P* and Pfc+i are related through 


x Pk 


-T — P 1 _ aa T 

_ r k + 1 aa • 


(5.56) 


Using Pfc and Pfc+i, we can rewrite 0* and Ok+i as follows: 


0* = P*A T Wy, (5.57) 

and 

0fc+i = Pfc+i (AA r Wy + ay). (5.58) 

If we eliminate A T Wy and P* from the preceding three equations, we have the 
familiar recursive formula for Ok+i- 


0 k + 1 = 0 k + Pfc+ia(y - a? 0 k ), 


(5.59) 


where Pfc+i can be expanded from Equation (5.56) using the matrix inversion for- 
mula in Lemma 5.6: 


Pfc+i 


p _ Pfcaa T P fc 
A V k A -I- a T Pfca 


(5.60) 


It is obvious that Equation (5.60) reduces to Equation (5.49) if A = 1. If A is 
small, then recent data axe weighted more and the algorithm is more capable of 
tracking time- varying parameters. However, at the same time the estimators may 
also fluctuate to reflect noise and disturbance. Therefore, the value of A is often 
task dependent and has to be determined experimentally. 

The bootstrapping techniques introduced in the previous section for initializing 
Po and 0 O also apply here for recursive LSE with forgetting factors. 
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5.7 STATISTICAL PROPERTIES AND THE MAXIMUM LIKE- 
LIHOOD ESTIMATOR* 

To examine the statistical qualities of the LSE derived in the preceding sections, we 
have to investigate the equation y = A 0 + e in a statistical framework. In other 
words, we assume e is a random vector and 0 is the true parameter vector; thus y is 
also a random vector depending on e. In particular, we shall introduce the Gauss- 
Markov conditions and demonstrate that the least-squares estimator 0 is unbiased, 
consistent, and has minimum variance. Moreover, we shall explain the concept of 
maximum likelihood estimation and establish an equivalence between the maximum 
likelihood estimator and the least-squares estimator under certain assumptions. 

First, let us begin by defining unbiased, consistent, and minimum variance esti- 
mators. 

Definition 5.8 Unbiased estimator 

An estimator 0 of the parameter 0 is unbiased if E[0] = 0, where £[•] indicates 
the statistical expectation. 


□ 


Definition 5.9 Consistent estimator 

A 

An estimator 0k is a consistent estimator of 0 if 

lim P(\\0k — 0\\ > £) = 0 for any e > 0. 

k—¥ oo 

Here P(-) is the probability function and 0k is the estimator using k input-output 
data pairs. 

□ 

Definition 5.10 Minimum variance estimator 

An estimator 0 is a minimum variance estimator of 0 if for any other estimator 0* : 

cov(0 ) < cov(0*), 

where cov(0) represents the covariance matrix of the random vector 0. 

□ 

The Gauss-Markov conditions state that the error vector e that accounts for 
the measurement noise and/or modeling error is a random vector satisfying 

1. £[e] = 0 

2. P[ee T ] = <t 2 I. 
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This is equivalent to the statement that the error vector e is a vector of m uncor- 
related random variables, each with zero mean and the same variance, a 2 . Under 
these conditions, we have the Gauss-Markov theorem, as follows. 

Theorem 5.2 Gauss-Markov theorem: LSE is unbiased and minimum variance 

Under the Gauss-Markov conditions, we have the Gauss-Markov theorem, which 
states that the LSE 0 is unbiased and has minimum variance when compared with 
all other unbiased estimators that are linear combinations of the observation y{. 
Proof: Taking expectation of y = AO + e leads to E[ y] = E[AO] + Efe] = AO. 
Hence 

E[0\ = £[(A T A)" 1 A r y] 

= (A T A)- 1 A T £[y] 

= (A T A)“ 1 A T Ae 

= e , 

A 

which shows that 0 is unbiased. (Note that the proof of unbiasedness does not 
require the second assumption of the Gauss-Markov conditions.) The proof of min- 
imum variance is a bit lengthy; it can be found, for example, in Section 2.9 of [8]. 

□ 

Due to the Gauss-Markov theorem, the LSE 0 is often referred to as the best 
linear unbiased estimator (BLUE for short), where best implies minimum vari- 
ance. The following theorem establishes the consistency of the LSE 0. 

Theorem 5.3 Gauss-Markov theorem: LSE is consistent 

Under the Gauss-Markov conditions, the LSE 0 is a consistent estimator of 0 if 
(A t A) -1 -^Oasm^oo, where m is the row dimension of A. 

Proof: Note that 

9 = (A T A) -1 A T y 

= ( A t A) “ 1 A t (AO + e) 

= 0+(A T A)~ 1 A T e. 

Thus 

cov(0) = E[(0 - O)(0 - 0) T ] 

= E[(A T A)- 1 A T ee T A(A T A)- 1 ] 

= (A T A)- 1 A T E[ee T ]A(A T A)- 1 
= (A t A)- 1 A t (<t 2 I)A(A t A)- 1 
= <7 2 (A T A) _1 . 

This implies cov(0) 0 as m oo if (A T A) -1 — y 0, which completes the proof. 
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□ 


The rest of this section will be devoted to exploring the relationship between 
the maximum likelihood estimator and the least-squares estimator under certain 
assumptions. Maximum likelihood estimation is one of the most widely used tech- 
niques for estimating the parameters of a statistical distribution. To introduce this 
method, let us assume that we have a random variable x whose probability density 
function (or probability function, if x assumes discrete values) is f(x;9), where 9 
is the parameter to be estimated. For a sample of n observations of this variable 
X \ , • • • x n , the likelihood function, L, is defined by 


L = f{xi\8)f{x 2 ;6) ••• f(x n ;9). (5.61) 


Without any prior information about the true value of 9, we would naturally pick 
a value of 9 that provides a high probability of obtaining the actual observed data 
x \ , ••• x n . Thus the maximum likelihood estimator (abbreviated as MLE), 
9, is defined as the value of 9 which maximizes L: 


dL 

89 


= 0 . 

9=9 


(5.62) 


Or, equivalently, 


5 In L 
d9 ' 


= 0 , 

9=9 


(5.63) 


since In is a monotonic function and the value of 9 which maximizes L will be 
the same as the value of 9 which maximizes InL. The following two examples 
illustrate the method of estimating the parameters for an exponential and a normal 
distribution. 


Example 5.3 MLE for exponential distribution 

Suppose a random variable x has an exponential distribution 

f(x]0) =9~ 1 e~ x/e . 

Then the likelihood function for m observations aq , • • • , x m takes the form 

L = {9- 1 e- Xi / e )(9- 1 e~ X2 / e ) • • • (9~ 1 e- x ^/ e ) 

= 9~ m exp(-9- 1 '£xi), 


and 


InL = —m In# — ^ Xi. 


Differentiating the preceding equation with respect to 9 gives 


d In L _ m 
d9 ~~J + ~ °' 


Therefore, 


9 = 


I > 

m 
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□ 


Example 5.4 MLE for normal distributions 
If a random variable x has a normal distribution 


f{x;n,a) = 




exp 


2 \ * ) 


where fi (mean) and a 2 (variance) are undetermined parameters. For m observations 
xi, • • • , x n , we have 


L = (-7L-) 

\y/2n a J 


m 


exp 


~ n) : 


and 

In L = —m ln(\/27rcr) — — ^ X{ — y ) 2 . 

Differentiating with respect to n and a yields 
5 In L 


dfi 


— o => -X ^T( Xi - y) = o => A = — 

o z — ' m 


d\ nL „ m 1. ^ 

— =° ^ ----(-2,7- =0 


d 2 _ EOfr -A)' 


m 


Now we consider the relationship between the maximum likelihood estimator 
and the least-squares estimator, which is regulated by the following theorem: 

Theorem 5.4 Equivalence between the LSE and the MLE 


In addition to the Gauss-Markov conditions, if we assume that each element of the 
error vector e is a random variable with normal distribution, then the LSE of 0 is 
precisely equal to the MLE of 6. 

Proof: Since each e* [= yi — Yj fj{ui)Qj] of the error vector e is uncorrelated with 
every other e* and normally distributed with zero mean and the same variance, a 2 , 
we have the following likelihood function: 


L = 


{y/Zna) m 

1 

(V2^o) m 


exp 


exp 


. m n I 

-^2 ~ J2 fjitiiWj] > 


7^2 (y- A0) T (y- A0) 
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Taking the natural logarithm of the preceding likelihood function gives us 
lnL = -^j(y-A0) T (y-A0) — mln(\/27rcr). 

Minimizing In L with respect to 6 = [6\ • • • d n ] T yields 

£§ (y- A0) T (y- A0)| 0=d = O. 

The solution of 0 from the preceding equation is identical to the least-squares es- 
timator given by Equation (5.19). Thus the MLEs of the regression coefficients 
are precisely the same as the least-squares estimators of these coefficients. This 
establishes the well-known principle that the LSE is equivalent to the MLE under 
uncorrelated Gaussian noise with zero mean and the same variance. 


□ 


5.8 LSE FOR NONLINEAR MODELS 

Although the least-squares methods for linear models are the most widely used tech- 
niques for fitting a set of observation data, occasionally it is appropriate to assume 
that the data are related through a model with nonlinear parameters. Nonlinear 
models (that is, nonlinear in the parameters to be estimated) can be divided into 
two types, which we will refer to as intrinsically linear and intrinsically nonlin- 
ear models, respectively. Through appropriate transformations of its input-output 
variables and fitting parameters, an intrinsically linear model can be expressed in 
the standard form of a linear model represented by Equation (5.12). Thus we can 
apply standard least-squares methods to approximate the optimal parameters ef- 
fectively. (Due to the use of transformations, the solution is not exactly optimal in 
minimizing the squared error measure.) 

If a nonlinear model cannot be expressed in a linear form after transformation, 
then it is intrinsically nonlinear. For nonlinear models of this type, we can apply 
nonlinear least-squares methods, described in Section 6.8 of Chapter 6. 

This section gives several examples of transformation methods that can be ap- 
plied to nonlinear models that are intrinsically linear. One such nonlinear model 
describes how the radioactivity y of certain chemicals decays with respect to time 
t: 

y = ae bt , (5.64) 

where a and b (< 0) are model parameters to be determined. According to the 
given training data set {(£*; yi), i = 1, . . . , m}, the squared error measure takes the 
form 

m 

E(a,b) = Yi(yi- ae b, ‘) 2 - 

i = 1 


(5.65) 
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Table 5.2. Nonlinear models that are intrinsically linear. 


Nonlinear models 

Transformation 

Linear forms 

y = ae bx 

Natural logarithm 

In y = In a + bx 

y = ax b 

Natural logarithm 

In y = In a + 6 In x 

y b+x 

Reciprocal 

I _ I 1 b 1 
y a ' a x 

V- b+x 

Reciprocal 

- = - + 
y a 1 a 


Attempting to minimize this error measure yields 

m 

|f = 2 ^ 2 ( yi -ae- bti )(-e bti ) = °, 

< i m 1 

If = - ae~ bti ){-atie bti ) = 0. 

i - 1 

Unfortunately, these are simultaneous nonlinear equations and it is hard, if not 
impossible, to find an analytic closed-form solution. 

If we take the natural logarithm of Equation ( 5 . 64 ), we obtain a linear model 
that relates t and In y through linear parameters In 6 and a: 

\ny = In a + bt, 

which indicates that the original nonlinear model is intrinsically linear. Conse- 
quently, through appropriate transformations, a nonlinear model that is intrinsically 
linear can be converted into a linear one and thus the LSE techniques developed 
earlier can be applied. Table 5.2 lists some nonlinear models that are intrinsically 
linear. 

On the other hand, for an intrinsically nonlinear model, such as 

y = a 0 + aix bl + a 2 x b2 , 

there exists no transformation techniques that can put all the fitting parameters 
0 = [ao, ai , 02, 6i, b 2 ] T or their transformed quantities into a linear form. In fact, it 
is obvious that among the five fitting parameters, [ao, ai, 02] are linear parameters 
while [61,62] are nonlinear ones. Thus we can apply LSE for the linear parameters 
and the iterative optimization methods (in Chapter 6) for the nonlinear ones. This 
leads to so-called hybrid learning, which is detailed in Section 8 . 5 . 

Sometimes it is necessary to take iterated transformations to reduce a compli- 
cated nonlinear model to a linear one. The following example illustrates such a 
case. 


Example 5.5 Iterated transformations 
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Table 5.3. Draining data for Example 5.6. 


ti 

0 

0.80 

1.84 

2.90 

4.06 

4.81 

6.07 

7.06 

8.15 

8.87 

9.98 

Vi 

0.98 

0.69 

0.47 

0.46 

0.29 

0.16 

0.23 

0.10 

0.03 

0.12 

0.01 


Suppose that we have a nonlinear model given by 


where x\ and X 2 are inputs and a, b , and c are parameters. Taking reciprocals, 
subtracting 1, and taking the natural logarithm of both sides, we can convert the 
original model into a linear one: 

ln(y -1 - 1) = In a + b\nx\+cx2, (5.67) 

which states that the transformed output In (y -1 — 1) is explicitly expressed as a 
linear function of the (transformed) parameters In a, b, and c. Other examples of 
iterated transformations are explored in the exercises at the end of this chapter. 



V = 


1 + ax\e CX2 ’ 


□ 


It should be kept in mind, however, that by applying this transformation method 
to Equation (5.64), we are not searching for the parameters that minimize the error 
measure v in Equation (5.65). Instead, the parameters we obtained minimize a new 
error measure defined by 


E'(a, b) 


m 

^(lny* - In a - bti ) 2 



(5.68) 


From empirical studies, it is known that the parameters a = a' and 6 = 6', which 
minimize E'(a, b), will not be too different from the optimal parameters a = a and 
b = b that minimize E(a,b ) in Equation (5.65), as long as the transformation is 
monotonic and the observation data consistently conform to the underlying model. 
Thus most of the time we can take a 1 and b' as good approximations of the optimal 
parameters a and b, respectively. In cases where a' and b' cannot be used to describe 
the observed data satisfactorily, we can use (a', b') as an initial guess and apply other 
iterative optimization methods discussed in Chapter 7 to improve the fitting. The 
following example shows the difference between minimizing v in Equation (5.65) 
and E ' in Equation (5.68). 
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Figure 5.4. Transformation method (dashed curve) versus iterative optimization 
(solid curve). (MATLAB file: transform.m) 
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Example 5.6 Transformation method versus nonlinear optimization 


Suppose that the underlying model describing the data set in Table 5.3 is known to 
be 


V = ae bt , 


(5.69) 


which is an intrinsically linear model. Figure 5.4 illustrates the training data set 
(marked with *), the solid curve that minimizes E(a, 6), and the dashed curve 
that minimizes E'(a,b). The solid curve is obtained through a simple iterative 
gradient method to be introduced in Chapter 6; the dashed curve is derived by the 
transformation plus the LSE method. In this example, these two curves do not 
differ too much. However, if the data set does not conform to the underlying model 
consistently, these two curves may be quite different in shape. Two such examples 
can be found in Exercise 13 of Section 8.1 and Exercise 7 of Section 10.3 in [1]. 


□ 


5.9 SUMMARY 

This chapter presents standard least-squaxes methods for linear models in system 
identification. Although our scope is confined to linear models, the underlying 
concepts can be extended to nonlinear models, as explained in Section 6.8. 

Least-squares techniques play a pivotal role in the literatures of adaptive control, 
adaptive signal processing, regression, and statistics. In the subsequent chapters, 
we shall see how these techniques can be applied effectively to adaptive networks 
(Chapter 8), supervised learning neural networks (Chapter 9), and adaptive neuro- 
fuzzy inference systems (Chapters 12 and 13). 
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EXERCISES 


1. Verify Lemma 5.5 on page 102 by direct differentiation. 

2. Completing the square is another method of finding the optimum of a 
quadratic function. Show that /(x) in Equation (5.1) can be reorganized as 

/(x) = (x + A -1 b) T A(x + A -1 b) + c - b r A -1 b. (5.70) 

Note that the first term on the right-hand side of the preceding equation is 
non-negative if A is positive definite, and non-positive if A is negative definite. 
Therefore, /(x) achieves its optimum c — b T A -1 b at x = — A -1 b. 


3. Explain how the derivation of the previous exercise must be modified if A is 
positive definite but not symmetric. 

4. Solve the least-squares problem by completing the square. Specifically, arrange 
E(0) in Equation (5.17) into the format of Equation (5.70) and thus verify the 
LSE formula in Equation (5.19) and the minimum error in Equation (5.24). 

5. Derive the weighted LSE in Equation (5.26) directly by setting the derivative 
of the weighted error measure in Equation (5.25) to zero. 


6. Find the least-squares polynomials in Figures 5.2(b) through 5.2(d). 


7. Prove Equations (5.34) through (5.39) to confirm their intuitive geometrical 
interpretations. 

8. Show that the recursive LSE formulas in Equation (5.50) axe equivalent to 

~ tat± V - (h + 1 - «£.,»*), 

^ (5.71) 


&k+ 1 

— Ok + 

P Jfc+l 

= P *- 


l k+ 1 J 


1 + a r+i p fc a fc+i 


9. The recursive LSE in Equation (5.47) can be derived in another way. From 
Exercise 4, it is clear that the error measure after k data pairs have been 
observed can be expressed as 

E k (8) = E k (8 k ) + (8 - 8 k ) T A. T A(8 - 8 k ). 

Based on E k (8), the error measure after observing the fcth data pair fa; y) can 
be formulated as 

E k+l (8) = E k (8) + (b - a T 0) T (6 - a r x) 

= E k (8 k ) + (8 - 0*) T A t A(x - 8 k ) + (b- a T 8) T (b - a T 8). 
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Show that Equation (5.47) can be obtained alternatively by 

dE k+1 (9) 


de 


= 0 . 


0=9 


fc+i 


10. Derive Equations (5.51) and (5.52), which show the dependency of 9 k and P* 
on their initial values. 

11. Show that Pj. in Equation (5.51) is positive definite. 

12. Let (a ;y) be the (k + l)th data pair and define a priori and a posteriori pre- 
diction errors as 


e prior — y a 

T i 


Show that 


e pOSt — V a &k+l' 

Hepostll = ^7 < llcpriorll- 


13. Derive the recursive LSE for multiple-output systems, as shown in Equa- 
tion (5.53). 

14. Derive in detail the formulas for the recursive LSE with forgetting factor A in 
Equations (5.59) and (5.60). 

15. Show that the following two nonlinear models are intrinsically linear: (a) y = 

3™; (b) y = In a + x — In (b + e x ). (In both cases, a and b are fitting 

1 -I- expf^j 

parameters.) 
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Chapter 6 


Derivative-based Optimization 


E. Mizutani and J.-S. R. Jang 


6.1 INTRODUCTION 

This chapter reviews a fundamental class of gradient-based optimization tech- 
niques, capable of determining search directions according to an objective func- 
tion’s derivative information. We discuss the preliminary concepts that sustain 
the descent algorithms used for solving minimization problems, as well as their rel- 
evant procedures. We begin with the steepest descent method and Newton’s 
method, which form the foundation of many gradient-based algorithms. Actually, 
many instrumental algorithms can be regarded as a form of compromise between 
steepest descent and Newton’s methods. We also describe their relevant techniques 
(e.g., conjugate gradient methods for practical advances in large problems, and 
the Gauss-Newton method and its Levenberg-Marquardt variant as a non- 
linear extension of the least-squares methods described in Chapter 5). 

A class of the gradient-based methods can be applied to optimizing nonlinear 
neuro-fuzzy models, thereby allowing such models to play a prominent role in the 
framework of soft computing. In fact, steepest descent and conjugate gradient meth- 
ods are major algorithms used for neural network learning in conjunction with back- 
error propagating process. The least-squares estimation is another widely employed 
algorithm because the sum of squared errors is chosen as the object function to be 
minimized in many cases. Hence, we discuss nonlinear least-squares problems with 
particular emphasis placed on Gauss-Newton methods with Levenberg-Marquardt 
notions. Those methods are commonly used in data fitting and regression involv- 
ing nonlinear models. Therefore, the gradient-based methods are closely related to 
neuro-fuzzy and soft computing techniques covered in the subsequent chapters. 

6.2 DESCENT METHODS 

In this chapter, we focus on minimizing a real-valued objective function E defined 
on an n-dimensional input space 0 = [6 1 , 0 2 , • • • , 0 n ] T . Finding a (possibly local) 
minimum point 0 = 0* that minimizes E(0) is of primary concern. 
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In general, a given objective function E may have a nonlinear form with respect 
to an adjustable parameter ©. Due to the complexity of E, we often resort to 
an iterative algorithm to explore the input space efficiently. In iterative descent 
methods, the next point 0 nex t is determined by a step down from the current 
point ©now h 1 a direction vector d: 


#next — #now + rjd, (6.1) 

where rj is some positive step size regulating to what extent to proceed in that 
direction. In neuro-fuzzy literature, the term learning rate is used for the step 
size rj. For our convenience, we alternatively use the following formula: 

0 k +i = 0 k + r) k d k (k = 1,2,3,...), (6.2) 

where k denotes the current iteration number, and ©now and © nex t represent two 
consecutive elements in a generated sequence of solution candidates {©&}. The 0 k 
is intended to converge to a (local) minimum 0*. 

The iterative descent methods compute the kth step r}kd k through two pro- 
cedures: first determining direction d, and then calculating step size rj. The next 
point © nex t should satisfy the following inequality: 

^(^next) = E(0now + *?d) < U(©now)- (6.3) 

The principal differences between various descent algorithms lie in the first proce- 
dure for determining successive directions. Once the decision is reached, all algo- 
rithms call for movement to a (local) minimum point on the line determined by 
the current point ©now and the direction d. That is, for the second procedure, the 
optimum step size can be determined by line minimization: 

rf = arg min <£(77), (6.4) 

»?>o 

where 

<f>(rj) = E(0now + rjd). (6.5) 

The search of rj* is accomplished by line search (or one-dimensional search) 
methods, as described in Section 6.5. 


6.2.1 Gradient-based Methods 

When the straight downhill direction d is determined on the basis of the gradi- 
ent (g) of an objective function E, such descent methods are called gradient-based 
descent methods. 

The gradient of a differentiable function E : R n —> R at © is the vector of first 
derivatives of E, denoted as g. That is, 


g(0) (= VE(9)) 


def 


dE{0) dE(0 ) 
de 2 ’ 


dE{0) 

de n 


( 6 . 6 ) 
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Figure 6.1. Feasible descent directions. Directions from the starting point Onow 
in the shaded area are possible descent vector candidates. When d = — g, d is the 
steepest descent direction at a local point 9 now- 

For simplicity, we frequently use g by suppressing the argument 9 in g (9). 

In general, based on a given gradient, downhill directions adhere to the following 
condition for feasible descent directions 1 : 

^,( 0 ) _ d E{Onovf + q d) _ g T d _ || g ^|| ||d|| cos(£(0 n ow)) < 0, (6.7) 

aT l *7=0 

where £ signifies the angle between g and d, and £(0now) denotes the angle between 
gnow and d at the current point 0 n ow> as illustrated in Figures 6.1 and 6.2. This 
can be verified by the Taylor series expansion of E: 

E(9now + wd) = E(9 now) + 7?g T d + 0(rj 2 ). (6.8) 

The second term on the right-hand side will dominate the third and other higher- 
order terms of rj when rj — ► 0. With such a small positive 77, the inequality (6.3) 
clearly holds when g T d < 0. The shaded area in Figure 6.1 denotes all feasible 
descent directions that satisfy the condition (6.7). Notably, the gradient directions 
are always perpendicular to the contour curves (see Exercise 2) . 

A class of gradient-based descent methods has the following fundamental form, 
in which feasible descent directions can be determined by deflecting the gradients 

1 This descent direction condition (6.7) does not guarantee convergence of the algorithms. 
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&ow = V£(0 now ) Angle £ 



Figure 6.2. Angle £ between gradient directions g and a descent direction d, which 
is determined by a certain algorithm at the current point Onow • Let N be the set 
of all possible next points ; N D {A,B,C,D,E,F,X,Y}. In the one-way downhill 
direction d, the next point O nex t ma y be one of six points — A, B, C, D, E, or F — or 
be in the vicinity of them, depending on step sizes . By comparison, in the steepest 
descent direction, O nex i, may be either X or Y, or close to them. 


through multiplication by G (i.e., deflected gradients): 


®next — ^now riG g, 


(6.9) 


with some positive step size r) and some positive definite matrix G. Clearly, when 
d = -Gg, the descent direction condition (6.7) holds since g T d = — g T Gg < 0. 
Many other variants of gradient-based methods (e.g., Newton’s method and the 
Levenberg-Marquardt method) possess the aforementioned form to bias the negative 
gradient direction (— g) for a better choice. Those variants are discussed in the 
subsequent sections. 
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Ideally, we wish to find a value of 0 nex t that satisfies the following 2 : 


S(^next) — 


dE(9) 

ae 


= o. 

^ = ^next 


(6.10) 


In practice, however, it is difficult to solve Equation (6.10) analytically. For min- 
imizing the objective function, the descent procedures axe typically repeated until 
one of the following stopping criteria is satisfied: 

1. The objective function value is sufficiently small; 

2. The length of the gradient vector g is smaller than a specified value; or 

3. The specified computing time is exceeded. 


6.3 THE METHOD OF STEEPEST DESCENT 

The method of steepest descent, also known as gradient method, is one of the 
oldest techniques for minimizing a given function defined on a multidimensional in- 
put space. This method forms the basis for many direct methods used in optimizing 
both constrained and unconstrained problems. Moreover, despite its slow conver- 
gence, the method is the most frequently used nonlinear optimization technique due 
to its simplicity. 

When G = 77I, with some positive value rj and the identity matrix I, Equa- 
tion (6.9) will be the well-known steepest descent formula: 

#next = #now - (6-11) 

In light of Equations (6.7) and (6.8), if cos£ = — 1 — that is, d points the same 
direction as the negative gradient direction (— g) — the objective function E can be 
decreased locally by the most amount at the current point 0 n ow- This finding im- 
plies that the negative gradient direction (— g) points to the locally steepest downhill 
direction. From a global perspective, going in the negative gradient direction may 
not be a shortcut to reach the minimum point 0* [Figures 6.1 and 6.2]. 

If the steepest descent method employs line minimization in Equation (6.4) — 
that is, if the minimum point 77* in a direction d is obtained at each iteration — we 
have 

— dE(Q n o W <— ??gnow) 

= V T E(0 now — f?gnow) gnow (6.12) 

= Snext Snow 

= 0, 

2 Note that Equation (6.10) is just the necessary condition because the gradient g is zero at any 
stationary point; namely, a maximum, a minimum, or a saddle point. 
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Figure 6.3. Newton’s (or Newton-Raphson) method for minimizing a general objec- 
tive function E, which is approximated locally as a quadratic form; this approximate 
function is minimized exactly. 


where g nex t is the gradient vector at the next point. The preceding equation indi- 
cates that the next gradient vector g nex t is always orthogonal to the current gra- 
dient vector gnow- Figure 6.2 depicts this situation at point X, where g nex t = gx- 
Section 6.6 also describes related situation. For a quadratic objective function, as 
discussed in Section 6.7, the method of the steepest descent with line minimiza- 
tion generates only two mutually orthogonal directions that are determined by the 
starting point [Figure 6.14(a)]. (Previous investigators note that even for a general 
n-input objective function, the steepest descent method tends to search asymptoti- 
cally merely in some lower-dimensional subspace [1, 12, 30].) Only if the contours of 
the objective function E form hyperspheres (or circles in a two-dimensional space), 
the steepest descent method leads to the minimum [Figure 6.4(a)] in a single step 
(cf. Theorem 6.1). Otherwise, the method does not necessarily direct toward the 
minimum point [Figure 6.4(b)]. 


6.4 NEWTON'S METHODS 
6.4.1 Classical Newton's Method 

The descent direction d can be determined by using the second derivatives of the 
objective function E, if available. For a general continuous objective function, the 
contours may be nearly elliptical in the immediate vicinity of the minimum. If 
the starting position ©now is sufficiently close to a local minimum, the objective 
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function E is expected to be approximated by a quadratic form: 

E{0) « -E^flnow) + g r (0 — 0now) + ^ — 0now) T H(0 — 0now) 5 (6.13) 

where H is the Hessian matrix, consisting of the second partial derivatives of E(0). 
The preceding equation is the Taylor series expansion of E(0) up to the second-order 
terms. Higher-order terms are omitted due to the assumption that ||0 — 0 n ow|| is 
sufficiently small. 

Since Equation (6.13) is a quadratic function of 0, we can simply find its min- 
imum point 0 by differenting Equation (6.13) and setting it to zero. This subse- 
quently leads to a set of linear equations: 


0 — g + H(0 — 0now)- (6-14) 

If the inverse of H exists, we have a unique solution. When the minimum point 0 
of the approximated quadratic function is chosen as the next point 0 n ow 5 we have 
the so-called Newton’s method or the Newton-Raphson method: 

0 = 0now — H x g. (6.15) 

The step — H -1 g is called the Newton step, and its direction is called the New- 
ton direction. The general gradient-based formula in Equation (6.9) reduces to 
Newton’s method when G = — H _1 and rj = 1. If H is positive definite and E(0) 
is quadratic, then Newton’s method directly gets to a local minimum in the single 
Newton step. If E{0) is not quadratic, then the minimum may not be reached in a 
single stride, and Newton’s method should be repeatedly employed; Figure 6.3 illus- 
trates the progress of repeated application of Newton’s method to a single- variable 
objective function. In this manner, Newton’s method proceeds to the minimum 
point, based on a second-order truncated Taylor series approximation defined in 
Equation (6.13). 

Figure 6.4 compares the steepest descent direction and the Newton direction 
when the objective functions are quadratic. When the contours axe circles in Fig- 
ure 6.4(a), both directions are indicated by the solid arrow. In contrast, when the 
contours are ellipsoids in Figure 6.4(b), the Newton direction (shaded arrow) points 
directly toward the unique minimum point whereas the steepest descent direction 
(dotted arrow) does not. In any case, the steepest descent direction is always per- 
pendicular to the contour line AB at any point S. 

Assume that a linear transformation T can be introduced: 


0' = T0, (6.16) 

such that the elliptical contours can be transformed to circular ones. Consequently, 
the steepest descent direction points toward the unique minimum, and a line search 
can be employed to find the minimum. For instance, when E(x,y) = x 2 + 5 y 2 , it 
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Figure 6.4. Comparisons between steepest descent direction and the Newton di- 
rection in two-dimensional space: (a) Both directions (solid arrow) are the same 
when the contours are circles, and (b) when the contours are elliptical, the Newton 
direction (shaded arrow) points toward the minimum point whereas the steepest de- 
scent direction (dotted arrow) does not. In any case, the steepest descent direction 
is always perpendicular to the contour line AB at a point S. The Newton direction 
is said to be conjugate to line AB in (b); see Section 6.6.1. 


can be transformed to E'(x,z ) = x 2 + z 2 , with O' — ( x,z ) T , 6 — (x,y) T , and 


y/E 0 
0 VE 


In the transformed space, the steepest descent method can be used (cf. Section 6.3). 
When T is a diagonal matrix as in the foregoing, such a transformation is called 
scaling [cf. Equation (6.78)]. In this sense, the Newton direction is theoretically 
scale invariant [45]. However, finding a successful transformation or scaling is dif- 
ficult in many cases [20]. 

A major disadvantage of Newton’s method is that calculating the inverse of the 
Hessian matrix is computationally intensive and may introduce numerical problems 
due to round-off errors. Section 6.4.3 introduces some alternative routines that 
estimate the Hessian (or its inverse). 


6.4.2 Modified Newton’s Methods* 

The classical Newton’s method defined in Equation (6.15) frequently requires some 
refinements before implemented. If the current point 0now is remote from a local 
minimum 0 * , the method may not yield a descent direction due to the truncated 
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higher-order terms in the Taylor series expansion of function E { •). 

The following modifications make Newton’s method more robust and reliable. 

Adaptive Step Length 

Even if the Hessian is positive definite, the quadratic approximation may not be 
satisfactory. That is, the direct Newton step with rj = 1 in Equation (6.15) may be 
too long to decrease E{0). 

A simple modification entails introducing a positive search parameter (or step 
length) adaptation. Consequently, Equation (6.15) can be modified to 

^next = ^now — *g, (6-17) 

where 77 is selected to minimize E [32]. Near the solution, 77 is expected to be close 
to 1 as in Equation (6.15). We can determine 77 to satisfy E(0 nex t) < E(0 n ow) i n 
a heuristic manner. For instance, 

T lk+i — 2 ^’ ^ ~ 0, 1, 2, 3, . . . 

where the starting 770 can be chosen to be 1.0 or a smaller value. This is the so-called 
step- halving procedure. For the same reason as in Equation (6.8), the procedure 
of this type can be guaranteed to guide the descent direction. Section 6.5 provides 
more complex step length determination schemes. 

Levenberg-Marquardt Modifications 

Furthermore, if the Hessian matrix is not positive definite, the Newton direction 
may point toward a local maximum, or a saddle point. 

The Hessian can be altered by adding a positive definite matrix P to H to make 
H positive definite. Levenberg [31] and Marquardt [33] introduced this notion in 
least-squares problems, as are described in Section 6.8.2. Later, Goldfeld et al. [22] 
first applied this concept to the Newton’s method. When P = AI, Equation (6.15) 
will be 

#next — #now - (H + AI) _1 g, (6.18) 

where I is the identity matrix and A is some nonnegative value. Figure 6.5 de- 
picts this notion. Depending on the magnitude of A, the method transits smoothly 
between the two extremes: Newton’s method (A — > 0) and well-known steepest de- 
scent method (A — y 00 ) . A variety of Levenberg-Marquardt algorithms differ in the 
selection of A. Goldfeld et al. computed eigenvalues of H and set A to a little larger 
than the magnitude of the most negative eigenvalue. See also Section 6.8.2. 

Moreover, when A increases, ||0 nex t — 0now|| decreases (see Theorem 6.2). In 
other words, A plays the same role as an adjustable step length 77 in Equation (6.17). 
That is, with some appropriately large A, the inequality (6.3) holds. Of course, the 
step size 77 can be further introduced and can be determined in conjunction with 
line search methods: 


#next — ^now -rj(H + XT) x g. 


(6.19) 
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Figure 6.5. Levenberg-Marquardt step shown in the highlighted arrow; The 
Levenberg-Marquardt method regulates a descent direction based on the Newton di- 
rection (in shaded arrow) and the steepest descent direction (in dotted arrow). The 
A also controls the magnitude of the step || 0 nex t — 0now\\ along the dotted curve. 
When A = 0, the step is the Newton step. 



>■ 0 



(c) 


-10 



Figure 6.6. (a) A quadratic surface E{0) = E(x,y ) — x 2 — y 2 ; (b) its gradient 
vector field and contour lines; (c) two Levenberg-Marquardt directions: LM(0.5) 
and LM(2.0), the Newton direction, and the steepest descent (SD) direction. These 
directions are normalized for displaying purpose. (MATLAB file: descent, m) 


Now we show a trivial example of Levenberg-Marquardt directions. Figure 6.6 
illustrates the hyperbolic paraboloid defined in E{0) = E{x,y) = x 2 — y 2 . Its 
Hessian matrix is indefinite; the Newton direction (— x, —y) T always points to- 
ward the saddle point (0,0). Figure 6.6(c) describes two representative Levenberg- 
Marquardt directions: LM (0.5) when A = 0.5 and LM{ 2.0) when A = 2.0. (Notice 
that those directions are normalized for displaying purpose.) This shows how the 
Levenberg-Marquardt modification guides a downhill direction in the vicinity of the 
saddle point. Gill et al.[20] discussed how to deal with such an indefinite Hessian 
based on Cholesky factorization to decide a better descent direction. 
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Trust-region Methods 

Recall the second-order truncated Taylor series approximation to the objective func- 
tion E in Equation (6.13): 


E(Qnow 4-5) & E(0now) -t- g ^ S -f- 

= Q(8)> 


( 6 . 20 ) 


where S = 0 nex t — #now* The effectiveness of this quadratic approximation can be 
measured by ^now 5 which is defined as 

_ E(0now) — E(0now + <5) 

^ ow _ £(0now) - q(S) 

In conjunction with the same formula as Equation (6.18), trust-region (or 
restricted step) methods, attempt to minimize q(S) only in a designated area, 
called a trust region , which may be defined as 



{<5 : ||<5|| < -Know} (Rnow > 0). 

The value of Rnow determines the size of the trust region, according to rules of 
thumb [15]: 

If t'now is small (e.g., z/now < 0.2), decrease Rnow* 

If z'now is large (e.g., z'now > 0.8), reduce Rnow* 

Otherwise, keep the same size of Rnow* 

This method can be implemented even when the Hessian matrix is indefinite or 
singular. Moreover, this regional search in conjunction with Equation (6.18) can be 
considered as an extension of the Levenberg-Marquardt idea [20]. 


6.4.3 Quasi-Newton Methods* 

Differentiating Equation (6.13) yields 


Hfc(0fc + i — Ok) = gfc+i — gjfc, (6.22) 

where k denotes the current iteration number. This formula intuitively indicates 
that Hessian H can be interpreted as the changing rate of the gradients between 
g(#now) and g(0 n ext)- Thus, H can be approximated on the basis of information of 
Ag k = (gfc+i — g k) and A Ok = (Ok+i — #*;)• This is the concept of quasi-Newton 
methods, also known as variable metric methods. Based on such derivative in- 
formation, the quasi-Newton methods attempt to construct gradually an approxi- 
mation M to 

• the Hessian matrix H (e.g., the Gill-Murray method [19, 20]), or 

• the inverse of the Hessian matrix H _1 , 
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as iterations progress. We discuss the second scheme as below. 

The approximations M ideally converge to H -1 near the solution point: 


0fc+i - Ok « Mjfe(gjfe+i - g*). 

However, determines A 0^; hence M^+i is used to satisfy the quasi-Newton 
condition (or secant condition) given by 

0k+ i — Ok — Mfc + i(gfc + i — gfc) (6.23) 


or 

A 0 k = Mfc+iAgfc. 

The initial Mq is often chosen as I. How to update M at each iteration is of priority. 
Two widely used updating schemes are the Davidon-Fletcher-Powell (DFP) [16] 
and the Broyden-Fletcher-Goldfard-Shanno (BFGS) [7, 8, 13, 21, 46] updating 
formulas. 


DFP formula 


M 


fc+i 


- 1VT ■ A0 fc A 61 MfcAgfcAgjf Mfc 

iVAfe + A0^Ag fe AgjM fc Ag fc 


BFGS formula 


tv K _ iv/r AgJ’MfcAgfc N A0 k A0l A0 fc Ag^M fe + M fc Ag fc A0 

M fc+ ! - M * + ( 1 + 

A0fcAgfe (f _ Ag fc A0r \ . A0fcA0 

A0fc Agfe 




Agfe ' A0 fe Agfe 


With the foregoing updating procedures, if 


(6.24) 


(6.25) 


A0 fc Ag fc > 0 


(6.26) 


and Mfc is positive definite, then the subsequent Mfc+i is positive definite, which 
is called hereditary positive definiteness. Due to Equation (6.2), A 0* = rjkdk . 
Hence, the preceding inequality is equivalent to 

gfc+idfc > gfcdfc. (6.27) 


This condition will hold if dfc is chosen to a descent direction with an appropriate 
step size rik, as illustrated in Figure 6.2. Therefore, if such a suitable step size is 
determined, both conditions (6.3) and (6.26) will be satisfied. 

BFGS is generally known to be more tolerant of inaccuracy of line minimization 
than DFP [40]. Moreover, DFP and BFGS are mutually complementary (or dual) 
formulas; that is, when the methods attempt to develop an approximation B to the 
Hessian matrix H, DFP has the same form as Equation (6.25) except that M is 
replaced with B and A0j t and Ag* are interchanged. Similarly, BFGS has the same 
form as Equation (6.24). 
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6.5 STEP SIZE DETERMINATION 

Recall the formula of a class of gradient-based descent methods given by Equa- 
tion (6.9): 

#next = ^now + rjd = Onow ~ r /Gg. 


This formula entails effectively determining the step size r). The efficiency of the step 
size determination affects the entire minimization process. For a general function 
E, analytically solving Equation (6.4) as in 

(f)'{rj) = 0, where = £(0 n ow + rid). 

is often impossible. That is, the univariate function (f>{rj) should be minimized 
on the line determined by the current point 0now and the direction d. This is 
accomplished by line search (or one-dimensional search) methods. 

In the rest of this section, we discuss the line minimization methods and their 
stopping criteria to prevent greedy search schemes from slowing down the entire 
minimization algorithm. 

6.5.1 Initial Bracketing 

The line search methods discussed in subsequent sections basically assume that 
the search area, or the specified interval, contains a single relative minimum; that 
is, the function E is unimodal over the closed interval. Determining the initial 
interval in which a relative minimum must lie is of critical importance. To begin 
with line searches, some routine must be employed for initially bracketing an 
assumed minimum into the starting interval. This kind of procedure can be roughly 
categorized into two schemes [41]: 

1. A scheme, by function evaluations, for finding three points to satisfy 

E(0k-i) > E(0k) < E(0k+ 1), 0k-i <0k < 0/c+i • 


2. A scheme, by taking the first derivatives, for finding two points to satisfy 


E'{0k ) < 0, E'{0k+ i) >0, 0k < 0k+ 1- 


For scheme 1, the common algorithm can be outlined as follows: 

Algorithm 6.1 A initial bracketing procedure for searching three points 0\, 02, 
and 03 

(1) Given a starting point 0q and h E R, let 0\ be 0q + h. Evaluate E{0 1 ). 
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If E(0 o) > E{0\ ), i i 1, 

(i.e., go downhill) go to (2). 


Otherwise, hi h, (i.e., set backward direction) 

(i.e., go uphill) E(0- 1 ) «- E{6{), 

0i i— Oq + h, 

i — 0, 
go to (3). 


(2) Set the next point by: h <- 2 h, 0 * + 1 i- 0* + h. 

(3) Evaluate E{O i+ i)\ 

If E(6i) > E(6i+ 1), i i— i -j- 1, 

(i.e., still go downhill) go to 2. 


Otherwise, Arrange 0j_i, 0*, and 0j + i in the decreasing order. 

Then, we obtain the three points: (0i, 02 , 03 )• 
Stop. 


□ 

The algorithm based on scheme 2 is left to the reader (Exercise 4). The following 
sections of line search methods assume that initial bracketing procedures of these 
types can adequately find several starting points. 

6.5.2 Line Searches 

The process of determining rj* that minimizes a one-dimensional function cj)(rj) is 
achieved by searching on the line for the minimum. The method of line searches 
(or one-dimensional searches) is important [5, 32, 44, 49] because higher dimen- 
sional problems are ultimately solved by repeating line searches. Also, line search 
algorithms usually include two components: sectioning (or bracketing), and poly- 
nomial interpolation. 


Newton’s Method 

When (f)(r)k ), (f>'(r)k), and are available, the classical Newton method in Equa- 

tion (6.15) can be applied to solving the equation <t>'{rik) = 0: 


rik+i =r}k~ 


<f>'{rik) 

Vim) 


(6.28) 


Figure 6.7(a) shows that the preceding formula determines the next step size r}k+i- 
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q ( Approximated Quadratic Function ) 




(a) (b) 

Figure 6.7. Newton’s method (left) and the secant method (right) to determine the 
step size. 


Secant Method 

If we use both r]k and rjk-i to approximate the second derivative in Equation (6.28), 
and if the first derivatives alone are available then we have an estimated T)k+i '• 


Aim) 

t/fc+l Ok (j>'(rik)— 

Vk—Vk-i 


(6.29) 


This is the so-called method of false position or the secant method, as illus- 
trated in Figure 6.7(b). (Also refer to Section 6.4.3.) 


Sectioning methods 

A sectioning algorithm begins with an interval [ai , b\] in which the minimum rj* 
must lie, and then reduces the length of the interval at each iteration by evaluating 
the value of (j) at a certain number of points. The two endpoints a\ and bi can be 
found by the initial bracketing described in Section 6.5.1. 

The bisection method is one of the simplest sectioning methods for solving 
<t>'(rf) = 0, if first derivatives are available. Let (f>'(r)) be 99 ( 77 ) for simplicity. The 
algorithm is shown next. 
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Figure 6.8. Golden section search to determine the step length. . 


Algorithm 6.2 Bisection method 

(1) Given a small value e € R and an initial interval with two endpoints a\ and 
02 such that Oi < 02 and v?(oi)</?(o 2 ) < 0: 

^eft °i 

bright ° 2 

(2) Calculate the midpoint rj m ^; That is, «— ^ r ^ght^ ^eft) 

If V 7 (bright ) ^ (^mid ) < 0, ^eft ^mid - 
Otherwise, bright <- r/ mid . 

(3) Check if |^ eft - r/ right | < e. If it holds, terminate the algorithm. Otherwise, 
go to (2). 


□ 

The bisection method replaces the right or left endpoint by the interval’s midpoint 
based on the function evaluation at the midpoint. The length of the bracketing 
interval is halved at each iteration. 

Next, we describe the golden section search method, which efficiently reduces 
the interval length based on function evaluations alone. The golden section requires 
4> to be neither a continuous function nor a differentiable one. 
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Given an initial interval [ai,b\] (3 77*), the next trial points (s*, £*) within the 
interval are determined by using the golden section ratio r: 

s k ~ bk ~ rfik a k) = ^>k ~ (pk a k) ? 

tk — CLk “I" ~^{pk O'k)} 

where r = 1+ 2 V ^ « 1.618. This procedure guarantees that a* < < £* < bk, as 

shown in Figure 6.8. 

The algorithm generates a sequence of the two ends, o* and &*, according to 

If Pi^k) ^ P(pk\ &jfc+i = = &jfe- 

Otherwise, a* + i = a*, 

The minimum point 77* is bracketed to an interval just ^ (« 0.618, i.e., approxi- 
mately one-third) times the length of the preceding interval. After the fcth iteration, 
the length of the bracketing interval shrinks to (61 — 01) (^)* _1 . 


Polynomial Interpolation 

Polynomial interpolation methods are based on curve-fitting procedures, which work 
well when the objective function possesses a certain degree of smoothness. 

A quadratic interpolation method constructs a smooth quadratic curve q 
that passes through three evaluated points, (771,^1), (772, <^2), and (773, <^3): 


3 

«(»/) = ^2 & 

i = 1 


11^(77 - T7j) 
n^i(77i - 77j) ’ 


(6.30) 


where pi = p(r]i), i — 1,2, 3. 
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The quadratic function has a unique minimum point, which can be easily determined 
by solving q'(rj ) = 0. Hence, the next trial point 7/ nex t is given by 


_ 1 (*& ~ */ 2 )0i + ~ *7i)02 + {rjj - r)l)(t > 3 

nGXt 2 (7/2 - 7 / 3 )</>i + (t/3 - 77i)02 + (t/i - 7/ 2 )</>3 


(6.31) 


(See Figure 6.9.) 

When four values <f>(rji ), <f>'(rji), 0 ( 772 ), <^(* 72 ) 3X6 available, a cubic interpola- 
tion method can not only construct a cubic equation, but can also determine the 
next point ?7 ne xt 35 its relative minimum point: 


where 


%ext =m- {m - m){ 


Vim) - ftim) + 27 -’’ 


P = ~3 » and 

7 = V/? 2 


(6.32) 


6.5.3 Termination Rules* 

In practice, it is nearly impossible to obtain the exact minimum point of the function 
0 by the aforementioned methods of line searches. Therefore, giving up exhaustive 
line searches at the expense of accuracy is normally desired to accelerate the entire 
minimization process. That is, a reasonable stopping criterion must be established 
to terminate the search procedures before they have converged. In the following, 
we describe some stopping rules widely applicable to line search methods. 


Goldstein Test 

Because 0 ( 77 ) = f?(0now 4 - 77 c!) (= E(9 nex ^)), Equation (6.3) can be rewritten as 

E(&next) = 0(0 < 0(0) = E(0 now)- (6.33) 

A value of 77 is considered to be not too large if, with a given // (0 < /x < |), 

0 ( 77 ) < 0(0) + M0'(O)T7- (6-34) 

Due to the feasible descent direction condition (6.7), 

<*>'(0) = g T d < 0, 

we can obtain, from the inequality (6.33), 

0(77) < 0(0) + /i.0' (0)77 < 0(0), 

where /x and 77 axe positive. That is, Equation (6.34) automatically guarantees that 
the direction is downhill. In addition, a value of 77 is considered to be not too small 
if 


0(77) > 0(0) + (1 - A*) 0' (0)77. 


(6.35) 



Sec. 6.5. Step Size Determination 


147 



Figure 6.10. Goldstein test. 


From the preceding two inequalities, we have 

(1 - //)<£' (0)77 < (f>(rj ) - (f>( 0)(= E(0 next ) “ E(0 n ow)) < W>'(0)t7. 

This can be rewritten as 

0 < ft < E ^ext) -E(0 now) <1 _ <1 

7/gd 

The Goldstein test requires that rj satisfy the preceding condition. Geometrically 
speaking, <£( 77 ) must lie in the shaded region between the two lines in Figure 6.10. 
With the smaller //, the acceptable range of 77 becomes wider [23]. 

Wolfe Test 

When the starting 77 is chosen to satisfy Equation (6.34) with a given n (0 < // < |), 
the Wolfe test [50] calls for the following condition to ensure that 77 is not too small: 

^fa)>( 1-^(0). (6.36) 


(See Exercise 5.) 
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Armijo Test 

The Armijo test [2] uses the following rule to guarantee that 77 is considered to be 
not too small: 

4>((v)>m + ^'(0 Kv (0</i<l), (6.37) 

where a value of £ > 1 should be selected. Starting with an arbitrary 77 , if Equa- 
tion (6.34) holds with a given // (0 < // < 1), 77 is increased by until the rule does 
not hold. In contrast, if Equation (6.34) does not hold, 77 is divided by £ until the 
decreased 77 satisfies Equation (6.34). 

6.6 CONJUGATE GRADIENT METHODS* 

In this section, we describe conjugate gradient methods, which originally made 
their debut as iterative algorithms for solving linear systems [27]. In the 1960s, the 
methods became widely used multivariate optimization algorithms because they can 
retain the power of second-order methods without calculating or storing second- 
derivative information. They are significantly less expensive than (quasi-) Newton 
methods, thereby malting them useful for a very large problem. 

First, we show that conjugacy is a generalized concept of orthogonality. Then 
we derive conjugate gradient algorithms. 

6.6.1 Conjugate Directions* 

Conjugate direction methods axe also based on the second-order approximation 
of the objective function E defined in Equation (6.13). As revealed in Figure 6.4, 
the Newton direction is trustworthy if the objective function E has a approxi- 
mately quadratic form. Although the steepest descent direction [the dotted arrow 
in Figure 6.4(b)] does not necessarily point toward a minimum point, the Newton 
direction (the shaded arrow) does. Such a Newton direction can be considered to 
be conjugate to the contour line AB at a point S, whereas the steepest direction 
is orthogonal to the contour line AB in Figure 6.4. 

Generally, given a symmetric ( n x n) matrix Q, two n-dimensional (direction) 
vectors dj and d* axe mutually conjugate with respect to Q, or Q-orthogonal, 
if the following equation holds: 


dj Qdfc = 0. (6.38) 

Especially if Q = I, then dj and d* are mutually orthogonal [see Equation (6.12)]. 

Lemma 6.1 If Q is positive definite, n mutually conjugate (nonzero) vectors d* 
are linearly independent. 

Proof: Assume that the set of nonzero vectors d* ( k = 1,2 ,...,n) are linearly 
dependent. By using scalars c* that axe not all zero, 53 *= 1 c kdk = 0 . Due to 
Equation (6.38), we can have CjdJ Qdj = 0. This finding implies that all Cj are 
zero because Q is positive definite. This is a contradiction. 
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□ 

By using this principle, a vector space spanned by a set of n mutually conjugate 
vectors d* is R n because n vectors d* are linearly independent. (For a detailed 
discussion of sets of vectors in Euclidean n-space R n , refer to refs. [18] and [42].) 
The quadratic objective function can be minimized in n-dimensional space in at 
most n search iterations in conjugate directions d* (k = l,2,...,n). The key is 
how to generate (linearly independent) conjugate direction vectors. 

6.6.2 From Orthogonality to Conjugacy* 

We consider an iterative step in a conjugate direction d& to minimize such a 
quadratic function as Equation (6.13). From Equation (6.22), we have 




(6.39) 


In a quadratic case, we can analytically specify the kth step size due to Equa- 
tion (6.55) in Section 6.7: 


m 


gfcdfc 

djHd* • 


(6.40) 


By using the preceding formula (6.40), we consider conjugate descent searches , 
starting with orthogonal descent searches . 


Coordinate Descent Searches 

Coordinate directions can be used to realize orthogonal descent searches. That is, 
we consider a special case in which orthonormal coordinate bases e* (A: = 1,2,...) 
are used for search direction vectors d*. Note that e*. is a unit vector of zeros except 
for the kth. element (e.g., ei = [1, 0, . . . , 0] T , e 2 = [0, 1, . . . , 0] T , . . . etc.) Simply, at 
the kth step, the kth downhill direction can be regarded as either e* or — e*. This is 
known as coordinate descent methods, whereby E may be ultimately minimized 
by sequentially changing the only kth component: i.e., min^ E( 0 i, 02 , . . . , B n ). 
Figure 6.11 depicts this concept. Furthermore, if the largest component (in absolute 
value) of the gradient vector g* is chosen as the Arth coordinate to search, this 
is called the Gauss- Southwell method. These coordinate descent searches are 
known to be less efficient than the steepest descent methods. 

In a quadratic case, Equation (6.40) can be rewritten as 




g I e k 

*1 He* ’ 


(6.41) 


and 


0*+i = &k + Vk^k- 


This equation corresponds to the well-known Gauss-Seidel iteration for solving 
a system of linear equations. 
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Figure 6.11. Concept of orthogonal searches toward the minimum point 0. 
The point 0k minimizes the objective function over the subspace spanned by 

{do, d 0 , . . . ,djfe_i). 



Figure 6.12. The procedures of the Gram-Schmidt orthogonalization. As an ex- 
ample , given three vectors si, S 2 , and S 3 , three mutually orthogonal vectors ui, U 2 , 
and U 3 are produced. In other words, the step-by-step Gram-Schmidt process in 
Equation ( 6 . 42 ) converts an arbitrary basis {si,S 2 ,S 3 } into an orthonormal basis 
{ui,u 2 ,u 3 }. 

Gram-Schmidt Orthogonalization 

Alternatively, given a set of direction vectors (s*,), Gram-Schmidt orthogo- 
nalization can be performed to make u* orthogonal to all previous directions 
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uj (j < k), as follows: 


jfc-i 


Ujfc 


E UjSjfc 

— — Uj. 

u,u. 


3 = 0 




(6.42) 


Figure 6.12 presents a simple example, in which the step-by-step Gram-Schmidt pro- 
cess in Equation (6.42) converts an arbitrary basis {si,s 2 ,s 3 } into an orthonormal 
basis {ui,u 2 ,u 3 }. 

Such a u* can also be set to the kth search direction d& Since conjugacy is a 
generalized concept of orthogonality, we assume that the following equation inspired 
by Equation (6.42) can generate a direction d* conjugate to all previous descent 
directions dj ( j < k), with respect to H: 


k — 1 

(6.43) 

j = o j n i 

If s*. is set to the unit vector e* used in the Gauss-Seidel iteration , the preceding 
formula will be equivalent to a Gaussian elimination procedure with do = ei. 


Lemma 6.2 In a conjugate direction algorithm, the gradients g& (k = 1,2, . . . ,n) 
satisfy 

gjfcdj = 0 for j < k, (6.44) 

where directions denoted by dj (j = 0,1, ... ,k — 1) are mutually conjugate with 
respect to H. 


Proof by Mathematical Induction: 

Verify that this is true for k = 1. Multiplied by do, Equation (6.39) with k = 0 
becomes 

dogi — dogo = 7?odoHdo. 

With Equation (6.40), 

„ _ 8o do 
% ~ d^Hdo ’ 

we have gido = 0. 

Now assume Equation (6.44) is true for k. Multiplied by d*, Equation (6.39) 
becomes 

d* gjfe+i = djfgjfc + T/jfcdl’Hd*. (6.45) 

Using Equation (6.40) yields the following equation: 

dfegfc+i = dfg t - ^d[Hd t 
= 0. 


Similarly, multiplied by dj j < k, Equation (6.39) produces 

djgjfc+i = djgjfc + gjfcdj Hd* = 0 

because of the induction assumption and conjugacy of the dj and d& . 
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□ 

This lemma means that point C is chosen as the next point 0 nex t in Figure 6.2. 


6.6.3 Conjugate Gradient Algorithms* 

When the gradient vector is used to determine conjugate directions, the algorithm 
is particularly called a conjugate gradient method. More specifically, by setting 
Sfc to the negative gradient — g*, Equation (6.43) becomes 

dTHg* 

d k = ~gk+^2 hTxth d i ( 6 - 46 ) 

j = 0 j J 

Although Equation (6.43) requires the memory of all directions d*, this disadvan- 
tage vanishes from this formula (6.46) because s* = — g*. This occurrence can be 
shown in the following manner. 

Let otkj be the second term on the right-hand side of Equation (6.46): 


*-1 


d T 


Hgjfc 


akj ~ 


j = o 


Using Equation (6.39) yields 


k-l T 


a .._y gt (gj+i-gj) d . 

k 3 ~h d J^-^ r 

Due to Lemma (6.2), we obtain 

_ f 0 if j < k — 1 
\ ock,k-i otherwise, 

where 

Sjfe (S/s Sk— l) j 

&k,k — l — ,t , >. d k- 

d k-A%k -gife-i) 

Hence, Equation (6.46) reduces to 


(6.47) 


d* = -g * 4- /3 k dk~i, (6.48) 

where 

o Sjfe ( Sk gfc-l) 

* d^_i (g& — gft-i) 

This is the so-called Beale-Sorenson’s formula [4, 47] with such a specific 
The term (3k can be defined in various ways. A family of conjugate gradient methods 
generates the direction vectors according to the basic formula (6.48) . 
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Furthermore, by switching k to k — 1 in Equation (6.46) , we have 


Jfe-2 

djfe-i = -gjfe-i + 

3 = 0 


djHdj 1 


By plugging this into Equation (6.49), we obtain Polak-Ribiere’s formula [38]: 


0k 


tf(ft-ft-i) 

siT-i Sfc— i 


(6.50) 


Now Equation (6.48) shows that g& is in the subspace spanned by {do, do, . . . , d*}. 
Therefore, due to Equation (6.44), we have the following property: 

g * gj = 0 for j < k. (6.51) 


This shows gjf +1 g k — 0 and, thus, Equation (6.50) leads to Fletcher- Reeves’s 
formula [17]: 


0k 


gfc g k 
gjfe 1 — igfc — i 


(6.52) 


Note that the first direction do is set to the steepest descent direction — g, when 
(3o = 0 in those formulas. 

These three formulas attributed to Fletcher-Reeves, Polak-Ribiere, and Beale- 
Sorenson, respectively, are listed as follows: 


Fletcher-Reeves: (3k 


slsk 


Polak-Ribiere: (3k 


gfe (gfe-gfe-i) 
gfc— i gfc— i 


Beale-Sorenson: (3k 


g^ (gfc— gfc- 1) 


From Equation (6.40), we have derived those conjugate gradient formulas by 
assuming that i ]k can be determined analytically. Thus, for applications to general 
objective functions, the methods require line minimization. Since Fletcher-Reeves’s 
formula (6.52) assumes that the property g*+igjfc = 0 holds, Polak-Ribiere’s for- 
mula (6.50) often works better in practical applications [39]. Obviously, for a 
quadratic objective function, the two formulas (6.50) and (6.52) are identical, but 
not for a general function. 

To enhance the performance of steepest descent, Fletcher and Reeves [17] in- 
troduced a systematic conjugate gradient algorithm with line minimization to de- 
termine step sizes. They also introduced a restart algorithm. Because of errors 
in computing directions and step sizes, the generated set of n direction vectors d* 
(k = 1,2, ... ,n) may not be mutually conjugate. Therefore, the conjugate gradi- 
ent methods are frequently implemented with some sort of restart algorithms, 
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(a) 



(b) 



X 


Figure 6.13. (a) A quadratic surface z = E(0) = E(x,y ) = x 2 +xy+y 2 — x+y; (b) 
its gradient vector field and two downhill gradient paths. (MATLAB file: gdssl.m) 


wherein the direction vector is reset to the steepest descent direction (d «— — g) 
after n or n + 1 iterations. Yet it is known that the frequency of restarts depends 
on the objective function E. Based on Beale’s technique [4], Powell [39] proposed a 
more effective restart algorithm; however, it requires more storage space of vectors. 


6.7 ANALYSIS OF QUADRATIC CASE 


Analyzing quadratic objective functions is of particular importance, because a gen- 
eral function can be well approximated by a quadratic function in the neighborhood 
of a (local) minimum, due to a consequence of Taylor’s theorem. 

Allow the objective function E to have a quadratic form: 

E(0) = \$ T A0 + b T 0 + c. (6.53) 

z 

The gradient of E{0) can be expressed as 


g(0) = A0 + b. 


(6.54) 


The step size plays a critical role in a class of descent methods. For a quadratic 
objective function, the line minimization problem in Equation (6.4) can be solved 
analytically. To minimize <f>(rj) = E(0 now + f/d) in a general descent direction d, 
we set the derivative of to zero: 

4KfT) = — - T + " d) = [A(9 + tfd) + b]V) = o. 

ar] 

Due to Equation (6.54), this leads to 


V 


* 


d T g 

d T Ad 


(6.55) 
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(a) (b) 




Figure 6.14. Progress of four descent algorithms with line searches for minimizing 
a quadratic function E(x,y ) = x 2 + xy + y 2 — x + y. (a) the steepest descent 
method. (MATLAB file: hem.m) (b) Newton’s method; (c) BFGS quasi-Newton 
method (MATLAB file: bfgs.m); (d) Fletcher-Reeves’s conjugate gradient method. 
(MATLAB file: cg.m) 


In the quadratic case, the general gradient-based formula in Equation (6.9) takes 
the following form: 


^next — ^now 


S T GS 

g T GAGg B ' 


(6.56) 


If we successfully make G close to A -1 , the rapid convergence property of Newton’s 
method can be drawn [see Figure 6.14(b)]. 

In the following, we consider a minimization problem in which the objection 
function, as shown in Figure 6.13(a), is a two-dimensional quadratic function defined 
by 

E($) = E(x , y) = x 2 + xy + y 2 — x + y, (6.57) 

where 0 = [x,y]. The gradient of this function is 


WE(x,y) = [2x + y - 1, x + 2y + 1] T . 


(6.58) 
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By setting the gradient to zero, the unique minimum point (x*,y*) = (1,-1) can 
be obtained in this example. Figure 6.13(b) plots the contour curves (solid ellipses), 
the negative gradient vector field (arrows), and two downhill gradient paths starting 
from two initial conditions [—3,1] and [1.5,2]. 


6.7.1 Descent Methods with Line Minimization 

We present the feasibility of applying the following four representative descent 
methods: the steepest descent method [Equation (6.11)], Newton’s method [Equa- 
tion (6.15)], BFGS quasi-Newton method [Equation (6.25)], and Fletcher-Reeves’s 
conjugate gradient method [Equation (6.52)]. (These four methods axe described in 
Sections 6.3, 6.4.1, 6.4.3, and 6.6, respectively.) Figure 6.14 illustrates the progress 
of the four descent methods, in which the step sizes were determined by Equa- 
tion (6.55). 

In the steepest descent method whereby d = — g, Equation (6.55) reduces to 


V 


* 



(6.59) 


As discussed earlier, if the objective function has elliptical contours, the steepest 
descent with line minimization is likely to produce a zigzag trajectory, known as 
hemstitching; it is illustrated in Figure 6.14(a). The search path is orthogonal 
at each step due to Equation (6.12), and there are only two search directions. If 
the contours are circular, the search would be a one-step process instead of the 
inefficient zigzag one shown here (see Figure 6.4). 

On the other hand, the minimum was reached in a single stride by Newton’s 
method [Figure 6.14(b)]. Quasi-Newton and conjugate gradient methods produced 
the same two-step trajectory to reach the minimum, as shown in Figures 6.14(c) 
and 6.14(d). 

Here we determined the step sizes analytically by using Equation (6.55). In 
general, determining the optimal step size in Equation (6.5) for a general function 
E(0) is not a trivial task, as discussed in Section 6.5. In the next subsection, 
based on the steepest descent method, we discuss fixed or heuristically updating 
step size determination schemes that are less computational and are applicable to 
any complicated objective functions. 


6.7.2 Steepest Descent Method without Line Minimization 

In recalling Equation (6.11), we first use a small fixed step size ry. 

#next — #now - VS- (6.60) 

A slightly different version of Equation (6.60) can be obtained by normalizing the 
gradient: 

^next = ^now — K ~Z~i 

O 


(6.61) 
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eta = 0.1 eta = 0.2 


eta = 0.3 





eta = 0.5 


eta = 0.6 


eta = 0.7 





Figure 6.15. The effects of step sizes rj by the steepest descent method defined in 
Equation( 6.60). The search is inefficient when 77 < 0.1 and unstable when 77 > 0.7. 
(MATLAB file: gdssl.m) 


where k is the real step size, indicating the Euclidean distance of the transition 
from Onow to 0 nex t : 

K ~ ll^next ~ ^nowll- 

To differentiate Equations (6.60) and (6.61), we refer to Equation (6.60) as simple 
steepest descent , and Equation (6.61) as a normalized version of simple steepest 
descent in this section. 

The magnitude of the step 77 g in Equation (6.60) with a fixed 77 automatically 
changes at each iteration due to different gradients of g. If the minimum point lies 
in the plateau landscape, then g tends to be infinitesimally small and the simple 
steepest descent in Equation (6.60) has slow convergence. On the other hand, the 
normalized version of simple steepest descent in Equation (6.61) with a fixed k 
always makes the same strides, neglecting how steep the slope is. For comparison, 
we apply Equations (6.60) and (6.61) to the same quadratic problem in the next 
subsections. 

Using a Small Fixed rj 

Figure 6.15 summarizes the results of applying Equation (6.60) with six different 
values of rj. The search is inefficient when rj is less than 0.2. When the step size rj 
is 0.6, the search path exhibits oscillatory behavior. If the step size exceeds 0.6, the 
search path diverges and the method fails. 

Figure 6.16 shows comparisons in Euclidean distance from the minimum point 
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Iteration 

Figure 6.16. Comparisons in Euclidean distance from the minimum point at each 
iteration of the simple steepest descent methods. 


at each iteration of the four simple steepest descent methods when r? = 0.1, 0.3, 0.5, 
and 0.6. As indicated in Figures 6.15 and 6.16, an ideal value of 77 may be close to 
0.5. 


How to Update rj 

Instead of using a fixed small step size, Chan and Fallside [9] introduced a heuris- 
tic method capable of updating r) in Equation (6.11) during the steepest descent 
process. They used the well-known backpropagation learning rule with a 
momentum term, as contrived by Rumelhart et al [43], to enhance the simple 
steepest descent method. The formula with an embedded momentum term u 
generates the direction vectors of 


dfc — — gfc +o;d/k_i. 


(6.62) 


(The momentum term uj regulates the influence of the previous descent direction. 
See also Section 9.4.2.) Chan and Fallside proposed a strategy to adjust the step 
size rj: 


Vk = T)k- l(l + ^ COSQfc) 


(6.63) 


where 


COSOfc = 


— l 

M K-iir 


This formula aims to increase the step size when the direction looks good (according 
to the angle a). The formula correlates with the situation in Figure 6.15 when 
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kappa = 0.25 kappa = 1 




Figure 6.17. The effects of step sizes k in the normalized version of simple steepest 
descent method, (a) Inefficient search when k is 0.25; (h) oscillatory behavior when 
k is 1. (MATLAB file: gdss2.m) 

Tf = 0.1 or 0.2. Many steps were taken in the same direction because ||g|| decreased 
when closer to the minimum in this example. 

Equation (6.62) is similar to a class of conjugate gradient methods defined in 
Equation (6.48). Chan and Shatin [10] presented an advantage of this method over 
Fletcher- Reeves’s conjugate gradient method (Section 6.6). 

Using a Small Fixed k 

When a normalized version of the simple steepest descent method [Equation (6.61)] 
is used, the actual step size k also affects the search results. To demonstrate the 
effects, we tested two small fixed values of kappa to minimize the quadratic function 
in Figure 6.13. Figure 6.17 summarizes those results; Figure 6.17(a) is the search 
path when k is 0.25, and Figure 6.17(b) is the search path when k is 1. A small 
value of k obviously leads to an inefficient search unless it is already in the vicinity 
of the minimum. On the other hand, a large value of k allows the search process to 
approach the minimum efficiently, but it then oscillates around the local minimum 
and cannot pinpoint it precisely. 

How to Update k 

The preceding example calls for an adaptive strategy to adjust the step size k 
dynamically. Based on empirical observations, the step size k can be updated 
according to the following two heuristic rules [29]: 

1. If the objective function undergoes m consecutive reductions, increase k by 
p%. 

2. If the objective function undergoes n consecutive combinations of one increase 
and one reduction, decrease k by q%. 

Figure 6.18 shows that typical values for m, n, p , and q are 4, 2, 10, and 10, 
respectively. These typical values are more or less chosen arbitrarily. This updating 
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Obj. Function 
Values 


Rule 1: Increase the step size after 4 downs (point A). 
Rule 2: Decrease the step size after 2 combinations 
of 1 up and 1 down (point B). 



Iterations 


Figure 6.18. Two heuristic rules for updating step size k. 


strategy is incorporated in the hybrid learning described in Chapters 12 and 19. 


6.8 NONLINEAR LEAST-SQUARES PROBLEMS 

In the following discussion of least-squares problems, we wish to optimize a model 
by minimizing a squared error measure between desired outputs and the model’s 
outputs. We assume a t-input single-output nonlinear model with n modifiable 
parameters: 

S/ = /(x,0), (6-64) 

where x is the input vector of size t, y is the model’s scalar output, and 0 is the 
parameter vector of size m. 

Given a set of m training data pairs (x p ; t p ), p = 1 , . . . ,m, the most common 
objective in data fitting and nonlinear regression problems is to find the optimal 0 
that minimizes the sum of squared errors: 

m = El = 1 (t P -y P ) 2 
= xr=i(* P -/( x P> 

= J±1T P (0) 2 
= 

where t p is the desired output when input is x p ; y p — /(x p , 0) is the model’s output 
when input is x p ; r p {0) is the difference between t p and y p ; and r(0) is a vector 
composed of ri(0), i = 1, . . . ,m. We shall derive the gradient and Hessian of the 
preceding objective function. Such information will be used in the developments of 
Gauss-Newton and Levenberg-Marquardt methods. In the remainder, we focus on 
such methods for solving the nonlinear least-squares problem [33, 28]; that is, 0 is 
chosen to be the least-squares estimator. 


0)Y 


(6.65) 
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The gradient vector of E at 0 is the vector of the first derivatives of E: 

s = s(0) = 

= 2£™ 1 r p (0)^^ < 6 - 66 ) 

= 2J t t, 

where J is the Jacobian matrix of r. [See Equation (5.7) on page 103.] Since 
r p(@) = tp ~ /( x p>^)> the pth row of J is equal to — V^/(x p ,0). 

Taking additional partial derivatives yields the Hessian matrix H at 0: 


H = H(0) 


where S is defined by 


dE(0) 

deae 1 ' 

2EL 


dr p (0 ) dr p (0 ) 

de T 


2(J T J + S), 


+ r p (0) 


d 2 r p {0) 

deae 1 ' 


m 

s = S(») = £ 


P= 1 


v p(0) 


d 2 r P (0) 
d0d0 T ‘ 


(6.67) 


6.8.1 Gauss-Newton Method 

The Gauss-Newton method, also known as the linearization method, first uses 
a Taylor series expansion to obtain a linear model that approximates the original 
nonlinear model. Then the ordinary least-squares methods discussed in Chapter 5 
are employed to estimate the model’s parameters. More specifically, if we allow the 
current parameters to be denoted by 0 n ow> the nonlinear model / in Equation (6.64) 
can be expanded in a Taylor series around 9 = 0now and only the linear terms are 
retained: 


n 

V — /( x > #now) + y! 

i=i 


df(x,9) 

dBi 




0=0now/ 


(0i — 0i, now)- 


(6.68) 


Inspection of the preceding equation reveals that the translated output y—f(x , 0now) 
is a linear function of the translated parameters 0* — 0j,now- 

Plugging this approximated linear model into E in Equation (6.65) yields 


E{0) 


t — /(x, 0now) — ~ ^now) 

r + J T (0 — 0now)|| 2 
r + J T 5|| 2 , 


where S = 0 — 0now- We thus obtain 0 nex t to minimize the preceding equation by 
solving Equation (6.10). That is, 


dE(0) 

d0 


^^next 


— J{r 4- J(0 ne xt ^now)} — 0. 
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(6.70) 


Therefore, we have the following Gauss-Newton formula , expressed as 

#next = ^now - (J T J) -1 J T r 
= 0now — ^(J T J) _1 g, 

where g is the gradient defined in Equation (6.66). The preceding equation conforms 
to the general form of gradient-based descent in Equation (6.9), with G = J T J being 
positive definite unless J is not of full rank. 

Other Derivations 

The Gauss-Newton method is named after the fact, introduced by Gauss in 1809, 
that Equation (6.70) can be obtained by modifying the Newton method. In light 
of Equations (6.66) and (6.67), Equation (6.15) (i.e., Newton’s method) can be 
rewritten as 

#next = #now - iH _1 - 


g 


= <W - (J T J + S)- 1 J T r. 


(6.71) 


The Gauss-Newton method is based on the assumption that S is smaller than 
J T J. This assumption amounts to neglecting S, which involves the second deriva- 
tives. We then obtain Equation (6.70). The strength of the Gauss-Newton algo- 
rithms from the Newton methods resides in that they require only the first deriva- 
tives. 

Notably, with the same notations used in Chapter 5, Equation (6.70) can be 
represented as 

#next - ^now + (A T A) _1 A T Ay ( . 

= 0now + A0, 

where A yi = Pi - /(xi,0 n ow), A0 = (A T A) _1 A T Ay, and J corresponds to -A. 
This occurrence can be shown in the following manner. After plugging all training 
data into Equation (6.68), we have the following matrix equation: 

AA 0 = r, 

where the ith element of r is n = U — /(x*,0 now)> A0 = 0 nex t — 0now ? and the 

. By using the 

0=0now 

standard least-squares method described in Chapter 5, we have 

A e = (A T A) -1 A T r. 


element at row i and column j of matrix A is ^ 


I m plementations 

Calculating (J T J) -1 in the Gauss-Newton method in Equation (6.70) may not be 
numerically stable for a certain J. To overcome this limitation, 6 can be obtained 
by solving the linear least-squares problem 

minimize^ ||r + J T 5|| 2 , (6.73) 

based on decompositions of J, such as QR decompositions [24] and singular- value 
decompositions. 
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Figure 6.19. The left figure illustrates the surface of the objective function defined 
in Equation (6.75). The right figure shows a Levenberg-Marquardt (LM) direction , 
the Gauss-Newton (GN) direction, and the steepest descent (SD) direction. The 
circle signifies the starting point, and the * denotes the optimal minimum. 

Hartley’s Method 

The Gauss-Newton method can be modified by introducing the step size, resulting 
in the form similar to Equation (6.17). The step length can be determined in 
conjunction with line searches to satisfy Equation (6.7). This modified method is 
called Hartley’s method [25, 26] or the dumped Gauss-Newton method. 

It is known that Hartley’s modification does not make the Gauss-Newton method 
as robust as the Levenberg-Marquardt modification, which follows. 

6.8.2 Levenberg-Marquardt Concepts 

The Levenberg-Marquardt concept discussed in Section 6.4.2 can be applied to the 
Gauss-Newton method. This method can handle well ill-conditioned matrices J T J 
by altering Equation (6.70) to 

#next = ^now ~ (J T J + AI) -1 g/ l , (6-74) 

where A is some nonnegative value and g^ = |g for simplicity. 

For instance, we consider a trivial nonlinear model f(xk,0) with two adjustable 
parameters p and q\ i.e., 0 ~ (p, q) T . The objective is to find the optimal 0* = 
(1, 10) that minimizes the following sum of squared errors. 

10 

E(8) = £(»* - /(x t , 0))\ (6.75) 

k= 1 

where £* is the desired output when the input is O.lfc [6]. Figure 6.19 illustrates 
the surface of the objective function E, and Levenberg-Marquardt direction that 
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is determined by using Equation (6.74) when A = 0.07. Clearly, the Levenberg- 
Marquardt direction (in highlighted arrow) is an intermediate between the Gauss- 
Newton direction (A ->• 0) and the steepest descent direction (A — >■ oo). 


Theoretical aspects 

To clarify the theoretical aspects behind Equation (6.74) we quote a series of three 
important theorems, which are due to Morrison [35] and Marquardt [33]. 

Theorem 6.1 Let A > 0 be arbitrary and let So satisfy the equation 

(J T J + AI)5 0 = -g h, (6.76) 

which corresponds to Equation (6.74). Then So minimizes the approximated E(0) 
defined in Equation (6.69), on the sphere whose radius ||5|| satisfies 

ll<5|| 2 = ll^oll 2 . - 

(For proof, see refs. [33] and [35].) 


Theorem 6.2 Let 5(A) be the solution of Equation (6.76) for a given value of 
A. Then ||5(A )|| 2 is a continuously decreasing function of X such that as A — > oo, 
||5(A)|| 2 ->0. 

Proof: A rough proof can be given as follows. Notably, taking derivatives of 
Equation (6.76) with respect to A yields 


5(A) 4 - (J T J + AI)^^ = 0. 


dX 


This equation can be rewritten as 


d5(A) 

dX 


= — (J r J + AI) - 1 5(A) 


Thus, we obtain 


TOT" = 2 ST W~dX 

= -25 t (A)(J t J + AI)" 1 5(A) 
< 0 . 


This inequality shows that ||5(A )|| 2 is a continuously decreasing function of A. (For 
detailed proof, see refs. [33] and [35].) 


□ 

Theorem 6.3 Let 7 be the angle between So and the negative gradient of E (—g). 
Then 7 is a continuous monotone decreasing function X such that as X — > 00 , 7 -> 0. 
Since g is independent of X, it follows that So rotates toward the negative gradient 
(—g) as X — » 00 . (For proof, see ref. [33].) 
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The situation of Theorem 6.3 is illustrated in Figure 6.5. 

In proving Theorem 6.2, Marquardt [33] used the concept of scaling [as in 
Equation (6.16)]. Now we show that some scaling rewrites Equation (6.76) as 

(J T J + AD)«o = -g/,. (6.77) 

The diagonal matrix D can be determined by 

Dii = (J J)ii “I" Pi 

where p is some nonnegative value [31, 36, 37]. For instance, 

D = diag(J T J) 4- I. 

Consider the following transformation: 

0' = Vd0, j = j'Vd, g * = Vn s ' h . (6.78) 

Equation (6.77) can be rewritten in the transformed space O': 

+ AI)«i = -g',,, (6.79) 

which is the same form as Equation (6.76). This implies that solving Equation (6.77) 
automatically accomplishes some scaling. When the columns of J T J have signifi- 
cantly different norms, Equation (6.77) may work better than Equation (6.76). 

Implementaions 

The Gauss-Newton-based Levenberg-Marquardt method works well in practice and 
has become the standard of nonlinear least-squares routines. There are many vari- 
ations in the Levenberg-Marquardt procedures. The following algorithm is one of 
them. 

Algorithm 6.3 Main loop of the Gauss-Newton based Levenberg-Marquardt algo- 
rithm 

(1) d «— 0.001, and factor «— 10.0 

(2) Evaluate -E(0now)- 

(3) hmax <- max{ diag( [J T Jfc* ]) }, k = 1, 2, . . .. 

(4) A <— d/imax- 

(5) Solve Equation (6.76) or (6.77) for S(= 0 nex t — #now)- 

(6) Evaluate £’(0 nex ^)(i.e.,£ , (0 n ow + <5)). 
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(7) 


If -E(0next) < -E(0now), d d/ factor, 

Onow <- ^next’ 
go to (4). 

Otherwise, d <r- d- factor, 

go to (4). 


□ 


Initial parameter values, which are problem dependent, must be found by a process 
of trial and error; the initial A may be 0.01, and factor can be set to a smaller value, 
such as 2.0. In place of procedures (3) and (4), we can simply use an alternative 
procedure: A 4— d. Also, to adapt A, a trust-region approach [14, 34] or a line 
search [36] can be incorporated. 

As discussed in solving Equation (6.70), the linear least-squares problem in (6.73) 
can be considered. Likewise, instead of calculating J T J, numerically stable methods 
can be applied to solving Equation (6.76) or (6.77) in procedure (5). That is, 8 can 
be obtained as the solution of the following linear least-squares problem: 


minimize,* 



(6.80) 


This new Jacobian matrix 


J 

x/AD 


can be regarded as an expanded matrix of J 


by assuming that n training data are added. 


6.9 INCORPORATION OF STOCHASTIC MECHANISMS 

Virtually no gradient-based descent algorithm is guaranteed to find the global op- 
timum of a complex objective function within a finite period of time. All descent 
methods discussed so far are deterministic in the sense that they inevitably lead 
to convergence to the nearest local minimum. Figure 6.20 displays a typical behav- 
ior common to deterministic gradient-based descent methods, where Figure 6.20(a) 
illustrates a bimodal surface containing two minima, and Figure 6.20(b) shows that 
two negative or downhill gradient paths, starting from two close but different initial 
points (2,3) and (2.5,3), converge to different minima. Without further explana- 
tion, selecting initial positions for the deterministic methods clearly has a decisive 
effect on the final results. 

In practice, however, knowing good starting points is nearly impossible. If the 
starting point is to be randomly selected, then it would be advisable to employ 
a random method to perturb the final positions where the method converges and 
begin the optimization process all over again. That is, the approach must somehow 
include stochastic nature. 
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Figure 6.20. The deterministic descent methods are sensitive to initial conditions: 
(a) a bimodal surface; (b) contour curves , and two negative gradient paths starting 
from two different points. (MATLAB file: gdconv.m) 


Also, if calculating the gradient is time consuming or difficult due to the com- 
plexity of the objective function, we should resort to stochastic or derivative-free 
optimization methods, which are introduced in Chapter 7 (e.g., genetic algorithms, 
simulated annealing, the random search method, and the downhill Simplex method). 
The stochastic derivative-free methods require many function evaluations for the 
attempt to descend toward minima without derivative information. Thus, they 
usually require more computation than for deterministic derivative-based methods 
to reach a satisfactory level. Hence, it may be better to approximate the gradient 
by evaluating a finite difference in a single step if the gradient is not available. For 
those reasons, constructing an algorithm that combines derivative information and 
stochastic nature is quite natural. 

There must be many algorithmic variations for fusing deterministic methods 
and stochastic ones, depending on their goals. One possible goal is to guarantee 
that the current point is a local minimum (i.e., neither a saddle point nor a local 
maximum [45]). Another possible goal is to escape a local minimum in an attempt 
to reach a global minimum. For those goals, many things must be considered. For 
instance, some criterion should be set up to determine when to switch a deterministic 
method to a stochastic one, and vice versa. The timing for switching must be 
important because a stochastic optimization scheme usually imposes a significant 
computational expense. 

Baba et al. proposed a hybrid algorithm that combines a conjugate gradient 
method and a random search method [3]. Since the neural network (NN) may pos- 
sibly construct many small local minima, such unifying algorithms have attracted 
much attention for NN learning [11, 48]. 

Furthermore, as long as squashing functions are employed in an NN architec- 
ture, they act as implicit constraints because the net inputs to hidden nodes may 
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get driven to their limits (saturation). In this sense, NN learning is not fully cate- 
gorized into the unconstrained optimization, which we discussed so far. Even with 
a sophisticated unconstrained optimization technique, NN learning may fail due to 
saturation (implicit constraint). 

6.10 SUMMARY 

In this chapter, we have addressed important aspects of widely employed gradient- 
based descent algorithms. The primary differences among them reside in selecting 
successive descent directions. Once the downhill direction is determined, all algo- 
rithms require a step down toward the minimum in the corresponding line. 

The steepest descent method is important for practical application. It can be 
employed as a touchstone for discovering the intrinsic difficulty of a given task 
and for establishing a certain reference for performance comparison. Particularly 
when the task is a large and complex problem, it must be worthwhile due to its 
simplicity. Also, we have presented a class of Newton methods. The usual practice 
is to modify the pure Newton’s (or Newton-Rafson) method since evaluating the 
Hessian may not be worthwhile due to its heavy computational requirements. Many 
promising algorithms can be regarded as a sort of intermediate between steepest 
descent and Newton methods (e.g., conjugate gradient methods and Levenberg- 
Marquardt methods, which are widely used and considered as good general-purpose 
algorithms for solving nonlinear least-squares problems) . 

For comparison, we tabulate a compendium of several gradient-based descent 
methods already discussed in this chapter: 


Descent Method 

G 

Steepest Descent 

I 

Newton- Raphson 
(+ Levenberg-Marquardt) 

(J^J + S)" 1 
(J T J + S + AI)- 1 

Gauss-Newton 
(+ Levenberg-Marquardt) 

(J T J + AI)' 1 


They are compared in terms of matrix G in their general formula (6.9). The matrices 
J and S are components of the Hessian matrix, as defined in Equations (6.66) 
and (6.67). 

The described methods can be applied to neuro-fuzzy learning in subsequent 
chapters. Thus, we have explored several variations of useful algorithms in an 
attempt to find a better algorithmic variation for soft computing methodologies. 


EXERCISES 


l. 


Prove the “feasible descent directions” condition 6.7 by the mean value theo- 
rem. 
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2. Prove that the gradient paths and contour curves of a continuous function E(0) 
are always perpendicular to each other. 

3. In the Polak-Ribiere conjugate gradient algorithm [see Equation (6.50)], the 
direction vectors are generated by 

Polak-Ribiere: d* = -g k + — d fc _i. 

Sfc— iSfc— i 

Verify that the kth direction vector (d*) and the rest (k — 1) of the direction 
vectors are mutually conjugate, by using Equation (6.38). 

4. In light of Algorithm 6.1 for the initial bracketing scheme 1 in Section 6.5.1, 
sketch another algorithm for scheme 2. 

5. To clarify the concept of the Wolfe condition in Section 6.5.3, draw a figure of 
a Wolfe test, similar to Figure 6.10. 

6. Apply the classical Newton method to the simple quadratic problem discussed 
in Section 6.7, and verify that the Newton method can locate the minimum 
point in a single step. (Again, the Newton direction is theoretically scale in- 
variant.) 
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Chapter 7 


Derivative- Free Optimization 


J.-S. R. Jang 


7.1 INTRODUCTION 

This chapter introduces four of the most popular derivative-free optimization meth- 
ods: genetic algorithms, simulated annealing, random search method, and downhill 
simplex search. They have been used extensively for both continuous and discrete 
optimization problems. Common characteristics shared by these methods are de- 
scribed next. 

Derivative freeness These methods do not need functional derivative information 
to search for a set of parameters that minimize (or maximize) a given objec- 
tive function. Instead, they rely exclusively on repeated evaluations of the 
objective function, and the subsequent search direction after each evaluation 
follows certain heuristic guidelines. 

Intuitive guidelines The guidelines followed by these search procedures are usu- 
ally based on simple intuitive concepts. Some of these concepts are motivated 
by so-called nature’s wisdom, such as evolution and thermodynamics. 

Slowness Without using derivatives, these methods are bound to be generally 
slower than derivative-based optimization methods for continuous optimiza- 
tion problems. 

Flexibility Derivative freeness also relieves the requirement for differentiable ob- 
jective functions, so we can use as complex an objective function as a specific 
application might need, without sacrificing too much in extra coding and 
computation time. In some cases, an objective function can even include the 
structure of a data- fitting model itself, which may be a neural network or 
fuzzy model. This means that by minimizing (or maximizing) a single objec- 
tive function of this type, we can do structure and parameter identification at 
the same time. 
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Randomness All of these methods (with the probable exception of the standard 
downhill simplex search) are stochastic, which means that they all use ran- 
dom number generators in determining subsequent search directions. This 
element of randomness usually gives rise to the overly optimistic view that 
these methods are “global optimizers” that will find a global optimum given 
enough computation time. In theory, their random nature does make the 
probability of finding an optimal solution nonzero over a fixed amount of 
computation time. In practice, however, it might take a considerable amount 
of computation time, if not forever, to find the optimal solution of a given 
problem. 

Analytic opacity It is difficult to do analytic studies of these methods, in part 
because of their randomness and problem-specific nature. Therefore, most of 
our knowledge about them is based on empirical studies. 

Iterative nature Unlike the linear least-squares estimator (Section 5.3), these 
techniques are iterative in nature and we need certain stopping criteria to 
determine when to terminate the optimization process. Let k denote an it- 
eration count and fk denote the best objective function obtained at count k\ 
common stopping criteria for a maximization problem include the following: 

• Computation time: a designated amount of computation time, or number 
of function evaluations and/or iteration counts is reached. 

• Optimization goal: fk is less than a certain preset goal value. 

• Minimal improvement: fk — fk-i is less than a preset value. 

• Minimal relative improvement: (fk — fk-i)/fk-i is less than a preset 
value. 

Both genetic algorithms (GAs) and simulated annealing (SA) have been receiv- 
ing increasing amounts of attention due to their versatile optimization capabilities 
for both continuous and discrete optimization problems. Moreover, both of them 
axe motivated by so-called nature’s wisdom : GAs are loosely based on the concepts 
of natural selection and evolution; while SA originated in the annealing processes 
found in thermodynamics and metallurgy. 

Random search and downhill simplex search are primarily for continuous opti- 
mization problems. Random search is the simplest and most intuitive optimization 
scheme of all; its implementation takes only a few lines in the MATLAB program 
randsrch.m, which is available available via FTP or WWW (see page xxiii). Down- 
hill simplex search is based on heuristic adaptation of a geometric object (a sim- 
plex) to explore a performance landscape, efficiently. The file fmins .m shipped with 
MATLAB is an implementation of this optimization method. 

All these techniques are often described as “weak” methods by the artificial 
intelligence community because of their relatively few assumptions about the prob- 
lems being solved. Fortunately, due to their high degree of flexibility, it is easy 
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to increase their efficiency by incorporating problem-specific heuristics into these 
methods. 

Although the concepts and implementations of random search and downhill 
simplex search axe simpler than those of GAs and SA, it should not be inferred that 
GAs and SA are better for all problems all the time. In general, one should not 
expect any single technique to outperform all the others in a given application. 

In what follows, we shall describe the basics of these derivative-free optimization 
methods. The characteristics of each method will be explained in sufficient detail 
that the reader can decide which method is better suited for his or her needs. 

7.2 GENETIC ALGORITHMS 

Genetic algorithms (GAs) [4, 7] are derivative-free stochastic optimization meth- 
ods based loosely on the concepts of natural selection and evolutionary processes. 
They were first proposed and investigated by John Holland at the University of 
Michigan in 1975 [7]. As a general-purpose optimization tool, GAs are moving out 
of academia and finding significant applications in many other venues. Their popu- 
larity can be attributed to their freedom from dependence on functional derivatives 
and to their incorporation of these characteristics: 

• GAs are parallel-search procedures that can be implemented on parallel- 
processing machines for massively speeding up their operations. 

• GAs are applicable to both continuous and discrete (combinatorial) optimiza- 
tion problems. 

• GAs are stochastic and less likely to get trapped in local minima, which in- 
evitably are present in any practical optimization application. 

• GAs’ flexibility facilitates both structure and parameter identification in com- 
plex models such as neural networks and fuzzy inference systems. 

GAs encode each point in a parameter (or solution) space into a binary bit 
string called a chromosome, and each point is associated with a “fitness” value 
that, for maximization, is usually equal to the objective function evaluated at the 
point. Instead of a single point, GAs usually keep a set of points as a population 
(or gene pool), which is then evolved repeatedly toward a better overall fitness 
value. In each generation, the GA constructs a new population using genetic 
operators such as crossover and mutation; members with higher fitness values are 
more likely to survive and to participate in mating (crossover) operations. After 
a number of generations, the population contains members with better fitness val- 
ues; this is analogous to Darwinian models of evolution by random mutation and 
natural selection. GAs and their variants are sometimes referred to as methods of 
population-based optimization that improve performance by upgrading entire 
populations rather than individual members. 
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Major components of GAs include encoding schemes, fitness evaluations, parent 

selection, crossover operators, and mutation operators; these axe explained next. 

Encoding schemes These transform points in parameter space into bit string rep- 
resentations. For instance, a point (11, 6, 9) in a three-dimensional parameter 
space can be represented as a concatenated binary string: 

lononoiooi 

11 6 9 

in which each coordinate value is encoded as a gene composed of four binary 
bits using binary coding. Other encoding schemes, such as Gray coding, can 
also be used and, when necessary, arrangements can be made for encoding 
negative, floating-point, or discrete-valued numbers. Encoding schemes pro- 
vide a way of translating problem-specific knowledge directly into the GA 
framework, and thus play a key role in determining GAs’ performance. More- 
over, genetic operators, such as crossover and mutation, can and should be 
designed along with the encoding scheme used for a specific application. 

Fitness evaluation The first step after creating a generation is to calculate the 
fitness value of each member in the population. For a maximization prob- 
lem, the fitness value fi of the ith member is usually the objective function 
evaluated at this member (or point). We usually need fitness values that are 
positive, so some kind of monotonical scaling and/or translation may be nec- 
essary if the objective function is not strictly positive. Another approach is 
to use the rankings of members in a population as their fitness values. The 
advantage of this is that the objective function does not need to be accurate, 
as long as it can provide the correct ranking information. 

Selection After evaluation, we have to create a new population from the current 
generation. The selection operation determines which parents participate in 
producing offspring for the next generation, and it is analogous to survival of 
the fittest in natural selection. Usually members are selected for mating with 
a selection probability proportional to their fitness values. The most common 
way to implement this is to set the selection probability equal to fi/ Ylk=i /fc> 
where n is the population size. The effect of this selection method is to allow 
members with above-average fitness values to reproduce and replace members 
with below-average fitness values. 

Crossover To exploit the potential of the current gene pool, we use crossover op- 
erators to generate new chromosomes that we hope will retain good features 
from the previous generation. Crossover is usually applied to selected pairs 
of parents with a probability equal to a given crossover rate. One-point 
crossover is the most basic crossover operator, where a crossover point on 
the genetic code is selected at random and two parent chromosomes are inter- 
changed at this point. In two-point crossover, two crossover points are se- 
lected and the part of the chromosome string between these two points is then 
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Figure 7.1. Crossover operators : (a) one-point crossover ; (b) two-point crossover. 


swapped to generate two children. We can define n - point crossover similarly. 
In general, (n — 1) -point crossover is a special case of n - point crossover. Ex- 
amples of one- and two-point crossover are shown in Figures 7.1(a) and 7.1(b), 
respectively. 

The effect of crossover is similar to that of mating in the natural evolutionary 
process, in which parents pass segments of their own chromosomes on to their 
children. Therefore, some children are able to outperform their parents if they 
get “good” genes or genetic traits from both parents. 

Mutation Crossover exploits current gene potentials, but if the population does 
not contain all the encoded information needed to solve a particular problem, 
no amount of gene mixing can produce a satisfactory solution. For this reason, 
a mutation operator capable of spontaneously generating new chromosomes 
is included. The most common way of implementing mutation is to flip a bit 
with a probability equal to a very low given mutation rate. A mutation 
operator can prevent any single bit from converging to a value throughout the 
entire population and, more important, it can prevent the population from 
converging and stagnating at any local optima. The mutation rate is usually 
kept low so good chromosomes obtained from crossover are not lost. If the 
mutation rate is high (above 0.1), GA performance will approach that of a 
primitive random search. Figure 7.2 provides an example of mutation. 

In the natural evolutionary process, selection, crossover, and mutation all occur 
in the single act of generating offspring. Here we distinguish among them clearly 
to facilitate implementation of and experimentation with GAs. 

Note that this section only gives a general description of the basics of GAs; 
detailed implementations vary considerably. For instance, we may choose a policy 
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Figure 7.3. Producing the next generation in GAs. 


of always keeping a certain number of best members when each new population is 
generated; this principle is usually called elitism. 

Based on the aforementioned concepts, a simple genetic algorithm for maximiza- 
tion problems is described next. 

Step 1: Initialize a population with randomly generated individuals and evaluate 
the fitness value of each individual. 

Step 2: 

(a) Select two members from the population with probabilities proportional 
to their fitness values. 

(b) Apply crossover with a probability equal to the crossover rate. 

(c) Apply mutation with a probability equal to the mutation rate. 

(d) Repeat (a) to (d) until enough members are generated to form the next 
generation. 

Step 3: Repeat steps 2 and 3 until a stopping criterion is met. 

Figure 7.3 is a schematic diagram illustrating how to produce the next generation 
from the current one. 

Example 7.1 Maximization of the “peaks” function using GAs 
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Peaks 



Figure 7.4. The “peaks” function. (MATLAB file: peaks .m) 


The “peaks” function is a two-input function defined as 
z = f(x,y) = S(l — x) 2 e~ x ~( y+1 ) —10 {-^—x 3 —y 5 )e~ x2 ~ y2 




The surface plot of this function, shown in Figure 7.4, can be obtained directly by 
typing peaks within MATLAB. (Here we have changed the color map to “gray” 
such that a patch’s brightness is proportional to its height.) 

To use GAs to find the maximum of this function, we first confine the search 
domain to be a square area of [—3, 3] x [—3, 3]. We use 8-bit binary coding for each 
variable, which results in a search space size of 2 8 x 2 8 = 65, 536. Each generation 
in our GA implementation contains 20 points or individuals. Each point’s fitness 
is defined as the value of the “peaks” function minus the minimum function value 
across the population. This guarantees that all fitness values are nonnegative. We 
use a simple one-point crossover scheme with the crossover rate equal to 1.0, which 
means that we always do crossover on selected parents. We choose uniform muta- 
tion (that is, each bit has the same probability of mutation) with the mutation rate 
equal to 0.01. We also apply elitism to keep the best two individuals across gen- 
erations. Figure 7.5(a) is the contour plot of the “peaks” function, with the initial 
population locations denoted by circles. After the fifth generation, the population 
starts to converge to the peak containing the maximum, as shown in Figure 7.5(b). 
Figure 7.5(c) is the population distribution after the tenth generation; the individual 
that is far away from the main cluster is the result of the mutation operator. 

Figure 7.6 is a plot of the best, average, and poorest values of the objective 
function across 30 generations. Since we are using elitism to keep the best two 
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(a) (b) (c) 


Figure 7.5. Using GAs to find the maximum of the . “peaks” function: (a) initial 
population; (b) population after the fifth generation ; (c) population after the tenth 
generation. (MATLAB file: go_ga.m) 



Figure 7.6. Performance of GAs across generations. (MATLAB file: go_ga.m) 


individuals at each generation, the “best” curve is monotonically increasing with 
respect to generation numbers. The erratic behavior of the “poorest” curve is due 
to the mutation operator, which explores the landscape in a somewhat random 


manner. 
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□ 


7.3 SIMULATED ANNEALING 

Simulated annealing (SA) [10, 15] is another derivative-free optimization method 
that has recently drawn much attention for being as suitable for continuous as for 
discrete (combinatorial) optimization problems. When SA was first proposed [10], 
it was mostly known for its effectiveness in finding near optimal solutions for large- 
scale combinatorial optimization problems, such as traveling salesperson problems 
(finding the shortest cyclical itinerary for a salesperson who must visit each of N 
cities in turn) and placement problems [18] (finding the layout of a computer chip 
that minimizes the total area). Recent applications of SA and its variants [9] also 
demonstrate that this class of optimization approaches can be considered competi- 
tive with other approaches when there are continuous optimization problems to be 
solved. 

Simulated annealing was derived from physical characteristics of spin glasses 
[10, 13]. The principle behind simulated annealing is analogous to what happens 
when metals are cooled at a controlled rate. The slowly falling temperature allows 
the atoms in the molten metal to line themselves up and form a regular crystalline 
structure that has high density and low energy. But if the temperature goes down 
too quickly, the atoms do not have time to orient themselves into a regular structure 
and the result is a more amorphous material with higher energy. 

In simulated annealing, the value of an objective function that we want to mini- 
mize is analogous to the energy in a thermodynamic system. At high temperatures, 
SA allows function evaluations at faraway points and it is likely to accept a new 
point with higher energy. This corresponds to the situation in which high-mobility 
atoms are trying to orient themselves with other nonlocal atoms and the energy 
state can occasionally go up. At low temperatures, SA evaluates the objective func- 
tion only at local points and the likelihood of it accepting a new point with higher 
energy is much lower. This is analogous to the situation in which the low-mobility 
atoms can only orient themselves with local atoms and the energy state is not likely 
to go up again. 

Obviously, the most important part of SA is the so-called annealing sched- 
ule or cooling schedule, which specifies how rapidly the temperature is lowered 
from high to low values. This is usually application specific and requires some 
experimentation by trial-and-error. 

Before giving a detailed description of SA, first we shall explain the fundamental 
terminology of SA. 

Objective function An objective function /(•) maps an input vector x into a 
scalar E: 

E = /(x), 

where each x is viewed as a point in an input space. The task of SA is to 
sample the input space effectively to find an x that minimizes E. 
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Generating function A generating function <?(•, •) specifies the probability den- 
sity function of the difference between the current point and the next point 
to be visited. Specifically, Ax (= x n ew — x) is a random variable with prob- 
ability density function ^(Ax, T), where T is the temperature. For common 
SA (especially when used in combinatorial optimization applications), <?(•, •) 
is usually a function independent of the temperature T. 

Acceptance function After a new point Xnew has been evaluated, SA decides 
whether to accept or reject it based on the value of an acceptance function 
h ( •, •). The most frequently used acceptance function is the Boltzmann 
probability distribution: 

h(AE, T) = 1 + exp(A £/ (cr)) > 

where c is a system-dependent constant, T is the temperature, and A E is the 
energy difference between Xnew and x: 


A E = /(x new) - /(x). 


The common practice is to accept x n ew with probability h(AE, T ). Note that 
when A E is negative, SA tends to accept the new point because it reduces 
the energy. When A E is positive, SA may accept the new point and end up 
in a higher energy state. In other words, SA can go either uphill or downhill; 
but the lower the temperature, the less likely SA is to accept any significant 
uphill actions. 

Annealing schedule An annealing schedule regulates how rapidly the tempera- 
ture T goes from high to low values, as a function of time or iteration counts. 
The exact interpretation of high and low and the specification of a good 
annealing schedule require certain problem-specific physical insights and/or 
trial-and-error. The easiest way of setting an annealing schedule is to de- 
crease the temperature T by a certain percentage at each iteration. 

Having presented this brief guide to clearer understanding of the S A terminology, 

we now describe the basic steps involved in a general SA method. 

Step 1: Choose a start point x and set a high starting temperature T. Set the 
iteration count A: to 1. 

Step 2: Evaluate the objective function: 

E = /(x). 


Step 3: Select Ax with probability determined by the generating function g( Ax, T). 
Set the new point x n ew equal to x + Ax. 




Sec. 7.3. Simulated Annealing 


183 


Step 4: Calculate the new value of the objective function: 


^new — /(x new)- 


Step 5: Set x to x n ew and E to Tnew with probability determined by the accep- 
tance function h(AE,T), where A E = E n ew — E. 

Step 6: Reduce the temperature T according to the annealing schedule (usually 
by simply setting T equal to rjT, where g is a constant between 0 and 1). 

Step 7: Increment iteration count k. If k reaches the maximum iteration count, 
stop the iterating. Otherwise, go back to step 3. 

In conventional SA, also known as Boltzmann machines [5, 6], the generating 
function is a Gaussian probability density function: 

x,T) = (27rT) -n ' /2 exp[- 1| Ax|| 2 /(2T)], 

where Ax (= xnew — x) is the deviation of the new point from the current one, T is 
the temperature, and n is the dimension of the space under exploration. It has been 
proven in refs. [3] that a Boltzmann machine using the aforementioned generating 
function g( •, •) can find a global optimum of /(x) if the temperature T is reduced 
not faster than To/ In k. 

For discrete or combinatorial optimization problems, each x is not necessarily an 
n-element vector with unconstrained values. Instead, each x is confined to be one 
of N points that constitute the solution space or input space. Usually N is finite, 
but it is very large such that the use of an exhaustive search method is impossible. 
Thus, adding randomly generated Ax to a current point x may not generate another 
legal point in the solution space, so generating functions are rarely used. Instead, 
to find the next legal point to explore, we usually define a move set, denoted by 
M(x), as the set of legal points available for exploration after x. Usually the move 
set M (x) represents a set of neighboring points of the current point x in the sense 
that the objective function at any point of the move set will not differ too much 
from the objective function at x. The definition of a move set is problem dependent 
and reflects our knowledge about the problem under consideration. Once the move 
set is defined, Xnew is usually selected at random from the move set, where every 
member stands an equal probability of being chosen. A well-known instance of 
combinatorial optimization is the traveling salesperson problem, which is tackled 
using the SA technique in the following example. 

Example 7.2 Traveling salesperson problem 

In a typical traveling salesperson problem (TSP), we are given n cities, and 
the distance (or cost) between all pairs of these cities is an n x n distance (or cost) 
matrix D, where the element dij represents the distance (or cost) of traveling from 
city i to city j. The problem is to find a a closed tour in which each city, except for 
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the starting one, is visited exactly once, such that the total length (cost) is mini- 
mized. The traveling salesperson problem is a well-known problem in combinatorial 
optimization; it belongs to a class of problems known as NP-complete 1 [16], in 
which the computation time required to find an optimal solution increases exponen- 
tially with n. For a TSP with n cities, the number of possible tours is (n — l)!/2, 
which becomes prohibitively large even for a moderate n. For instance, finding the 
best tour of the state capitals of the United States (n = 50) would require many 
billions of years even with the fastest modern computers. 

For a common traveling salesperson problem, we can define at least three move 
sets for SA: 

Inversion Remove two edges from the tour and replace them to make it another 
legal tour. This is equivalent to removing a section (6-7-8-9) of the tour 
and then replacing with the same cities running in the opposite order. See 
Figures 7.7(a) and 7.7(b). 

Translation Remove a section (8-7) of the tour and then replace it in between two 
randomly selected consecutive cities (4 and 5). See Figures 7.7(b) and 7.7(c). 

Switching Randomly select two cities (3 and 11) and switch them in a tour. See 
Figures 7.7(c) and Figures 7.7(d). 

Generally speaking, the switching move set tends to rupture the original tour 
and results in a tour that has a total length (or cost) significantly different from 
that of the original tour. Comparisons between the inversion and switching move 
set can be found in refs. [15]. 

We apply the SA technique with the inversion move set to a TSP with 100 cities; 
Figure 7.8 is a typical result after 10 minutes of simulation on a SUN SPARC II 
workstation. The MATLAB file is tsp.m. (Note that the function tsp.m is not fully 
optimized for speed yet; it only serves to demonstrate the important aspects of SA 
techniques.) 


□ 

Variants of Boltzmann machines include the Cauchy machine or fast simu- 
lated annealing [21, 22], where the generating function is the Cauchy distribution: 

T 

” (j|Ax|| 2 + r 2 )( n+1 )/ 2 ' 

The fatter tail of the Cauchy distribution allows it to explore farther from the 
current point during the search process. 

Another variant of the original SA, the so-called very fast simulated re- 
annealing (VFSR) [8], was designed for optimization problems in a constrained 


X NP stands for non-polynomial. 
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Figure 7.7. Three operations for generating move sets in the traveling salesperson 
problem. 


search space. For a parameter Xi(k) in dimension i at annealing time k , the new 
value is generated by 


Xi(k + 1) = Xi(k) + A iix?** - x? m ), 

where A* £ [—1,1], and xf 1111 are the maximum and minimum of the zth 

dimension. This is repeated until a legal Xi between £™ in and r™ ax is generated. 
The generating function for A* is 

9iXi ’ Ti) = 2(1^1 + T f ) ln(l + 1/T;) ' 

To generate A* according to the preceding distribution, we can apply the following 
formula: 

A, = sgn(tt* - 0.5)r4(l + 1/TOl 2 ”'- 1 ' - 1], 

where Ui is a uniformly distributed random variable between 0 and 1. A global 
optimum can be obtained statistically if the annealing schedule is 

Tj(fc) =T;(0)exp(— Cjfc 1 /"), 
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Figure 7.8. The result of a 100-city traveling salesperson problem using the inver- 
sion move set in simulated annealing. (MATLAB file: tsp.m) 

where c; is a user-defined parameter whose value should be selected according to 
the guidelines in refs. [9]. The same type of annealing schedule should be used for 
both the generating function g{ *, •) and the acceptance function h(-, •). 

Reannealing or temperature rescaling in the VFSR algorithm periodically 
rescales the generating temperature in terms of the sensitivities Si calculated at the 
most current minimum value of the cost function E *: 

si = \dE*/dxi\. 

The annealing time ki is adjusted according to s*, based on the heuristic concept 
that the generating distribution used in a relatively insensitive dimension should 
be wider than that of the distribution produced in a dimension more sensitive to 
change. A detailed discussion of the reannealing process can be found in refs. [9]. 

VFSR was reported to be faster than genetic algorithms on several test problems; 
see refs. [9] for more detail. 

7.4 RANDOM SEARCH 

Random search [2, 11, 12, 20] explores the parameter space of an objective func- 
tion sequentially in a seemingly random fashion to find the optimal point that 
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minimizes (or maximizes) the objective function. Besides being derivative free, the 
most distinguishing strength of the random search method lies in its simplicity, 
which makes the method easily understood and conveniently customized for spe- 
cific applications. Moreover, it has been proved that this method converges to the 
global optimum with probability 1 on a compact set [1, 20]. However, the theoret- 
ical result of convergence to the global optimum is not really important here since 
the optimization process itself could take a prohibitively long time. 

Here we shall start with the most primitive version proposed by Matyas [11]. 
Following some heuristic guidelines, we shall also present a modified version that is 
more efficient. 

Let /(x) be the objective function to be minimized and x be the point currently 
under consideration. The original random search method [11] tries to find the 
optimal x by iterating the following four steps: 

Step 1: Choose a start point x as the current point. 

Step 2: Add a random vector dx to the current point x in the parameter space 
and evaluate the objective function at the new point at x + dx. 

Step 3: If /(x 4- dx) < /(x), set the current point x equal to x + dx. 

Step 4: Stop if the maximum number of function evaluations is reached. Other- 
wise, go back to step 2 to find a new point. 

This is a truly random method in the sense that search directions are purely 
guided by a random number generator. There are several ways to improve this 
primitive version; these are based on the following observations: 

Observation 1: If search in a direction results in a higher objective function, the 
opposite direction can often lead to a lower objective function. 

Observation 2: Successive successful searches in a certain direction should bias 
subsequent searching toward this direction. On the other hand, successive 
failures in a certain direction should discourage subsequent searching along 
this direction. 

The first observation leads to a reverse step in the original method. The second 
observation motivates the use of a bias term as the center for the random vector. 
After including these two guidelines, the modified random search method [20] in- 
volves the following six steps: 

Step 1: Choose a start point x as the current point. Set initial bias b equal to a 
zero vector. 

Step 2: Add a bias term b and a random vector dx to the current point x in the 
input space and evaluate the objective function at the new point at x+b+dx. 
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Figure 7.9. Flow chart for the random search method. 


Step 3: If /(x + b + dx) < /(x), set the current point x equal to x + b + dx and 
the bias b equal to 0.2b -I- 0.4dx; go to step 6. Otherwise, go to the next step. 

Step 4: If /(x + b — dx) < /(x), set the current point x equal to x + b — dx and 
the bias b equal to b — 0.4dx; go to step 6. Otherwise, go to the next step. 

Step 5: Set the bias equal to 0.5b and go to step 6. 

Step 6: Stop if the maximum number of function evaluations is reached. Otherwise 
go back to step 2 to find a new point. 

A detailed flow chart of the preceding steps is shown in Figure 7.9. Usually the 
initial bias is set to a zero vector. Each component of the random vector dx should 
be a random variable that has a zero mean and a variance proportional to the range 
of the corresponding parameter; this allows the method to apply the same degree 
of exploration for each dimension of the parameter space. A further enhancement 
is to make the variance of each element in dx decrease with time; this serves the 
same purpose as the cooling schedule in simulated annealing. The file randsrch.m, 
available via FTP or WWW (see page xxiii), is an simple implementation of the 
improved random search method with time-independent variances. 

Unlike genetic algorithms and simulated annealing, the random search method 
is primarily for continuous optimization problems. It is possible to come up with 
a random search method for discrete or combinatorial optimization problems, but 
then the preceding observations may no longer be true and we might lose the ad- 
vantages of the modified method. 

Figure 7.10 demonstrates how the random search method tried to find the min- 
imum of the “peaks” function from three different start points; these start points 
are surrounded by circles while the corresponding endpoints are denoted by crosses. 
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x 

Figure 7.10. Random search applied to the “peaks” function. (MATLAB file: 
go_rand . m) 

(For a more distinct illustration, we flipped the color map such that brighter regions 
represent valleys instead of peaks.) Despite its simplicity, the method performed 
well on this optimization problem. 

7.5 DOWNHILL SIMPLEX SEARCH 

Downhill simplex search [14] is a derivative-free method for multidimensional 
function optimization. As with other derivative-free approaches, this search method 
is not very efficient compared to derivative-based methods. However, the concept 
behind downhill simplex search is simple and it has an interesting geometrical in- 
terpretation. 

We consider the minimization of a function of n variables with no constraints. 
We start with an initial simplex, which is a collection of n + 1 points in n- 
dimensiona! space. The downhill simplex search repeatedly replaces the point hav- 
ing the highest function value in a simplex with another point. (Note that this 
method has little to do with the simplex method for linear programming, except 
that both of them make use of the geometrical concept of a simplex.) When com- 
bined with other operations, the simplex under consideration adapts itself to the 
local landscape, elongating down long inclined planes, changing direction on en- 
countering a valley at an angle, and contracting in the neighborhood of a minimum. 
These operations are described next. 

To start the downhill simplex search, we must initialize a simplex of n + 1 
points. For example, a simplex is a triangle in two-dimensional space and a tetra- 
hedron in three-dimensional space. Moreover, we would like the simplex to be 
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nondegenerate — that is, it encloses a finite inner n-dimensional volume. An easy 
way to set up a simplex is to start with an initial starting point Po and the other 
n points can be taken as 


Pi — P 0 AjCj, ® — !}•••} n, 


(7.2) 


where e*’s axe unit vectors consisting of a basis of the n-dimensional space and A * is 
a constant reflecting the guess of the characteristic length scale of the optimization 
problem in question. 

We write yi for the function value at Pi and let 


l = arg minj(yj) (l for “low”), 
h = arg maxi(yi) (h for “high”). 


( 7 . 3 ) 


In other words, l and h are respectively the indices for the minimum and maximum 
of yi. In symbols, 

yi = mini(yi) 

y h = maxi(yi) ^ ' 

Let P be the average (centroid) of these n -I- 1 points. Each cycle of this method 
starts with a reflection point P* of P Depending on the function value at P*, we 
have four possible operations to change the current simplex to explore the landscape 
of the function efficiently in multidimensional space. These four operations are (1) 
reflection away from P/>; (2) reflection and expansion away from P/>; (3) contraction 
along one dimension connecting P/> and P; and (4) shrinkage toward P / along all 
dimensions. These four operations are shown in Figure 7.11 when / is a two-input 
function. 

Before describing the full cycle of the simplex search, we need to define four 
intervals to be used in the search process: 


• Interval 1: 

• Interval 2: 

• Interval 3: 

• Interval 4: 


{y\y < yi} 

{ y\yi <y< max^^y*}} 
{y I rna x i> i^/ l {y i } < y <y h } 
{y\yh < y}- 


These intervals are shown in Figure 7.12. 

A full cycle of the downhill simplex search involves the following four steps: 

Reflection: Define the reflection point P* and its value y* as 

P* = P + a(P-P„), 

V ' = /( P*), (7 - 5) 

where the reflection coefficient a is a positive constant. Thus P* is on the 
line joining P h and P, on the far side of P from P*. Depending on the value 
of y*, we have the following actions: 
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Figure 7.11. Outcomes for a cycle in the downhill simplex search after (a) re- 
flection away from P^; (b) reflection and expansion away from P^; (c) contraction 
along one dimension connecting P h and P; (4) shrinkage toward P / alone all di- 
mensions. 


y,,i*l,h 

y, . *■ . y h 

| h 1 ^ 

r r y y 

Interval 1 Interval 2 Interval 3 Interval 4 

Figure 7.12. Four intervals used in the downhill simplex search. 


1. If y* is in interval 1, go to expansion. 

2. If y* is in interval 2, replace P^ with P* and finish this cycle. 

3. If y* is in interval 3, replace P/j with P* and go to contraction. 

4. If y* is in interval 4, go to contraction. 

Expansion: Define the expansion point P** and its value y** as 


P** = P + 7(P* - P), 

if* = /( P**), 


(7.6) 


where the expansion coefficient 7 is greater than unity. If y** is in interval 
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1, replace with P** and finish this cycle. Otherwise, replace P^ with the 
original reflection point P* and finish this cycle. 

Contraction: Define the contraction point P** and its value y** as 

p** = p+/?(Pfc-p), t77 , 

v"‘ = /( P**), ( ’ 

where the contraction coefficient (3 lies between 0 and 1. If y** is in interval 1, 

2, or 3, replace P/> with P** and finish this cycle. Otherwise, go to shrinkage. 

Shrinkage: Replace each P* with (Pj 4- Pj)/2. Finish this cycle. 

The foregoing cycle is repeated until a given stopping criterion is met. Common 
stopping criteria are described in Section 7.1. A complete flow chart of the foregoing 
steps is given in Figure 7.13. 

Before starting using this method, we still need to determine three constants a, 
(3 and 7, which are the coefficients for reflection, contraction, and expansion. Gen- 
erally speaking, the optimal values for these coefficients are application dependent, 
and the best way to select their values is by doing some trial-and-error experiments. 
A good starting point is (a, (3, 7) = (1,0.5, 2); these values are suggested in Nelder 
and Mead’s original paper [14]. 
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Figure 7.14. Applying downhill simplex search to the “peaks” function. (MATLAB 
file: go_simp.m) 


Figure 7.14 demonstrates the transition of the current point of the downhill sim- 
plex search when minimizing the “peaks” function from three different start points; 
these start points are surrounded by circles while the corresponding endpoints are 
denoted by crosses. (Again, we flipped the color map such that brighter regions 
represent valleys instead of peaks.) 

One more thing to note is that this method is deterministic — starting from the 
same initial point, this method will always lead to the same final point for a given 
objective function. This implies that the method will potentially lead to a local 
minimum and stay there forever. To enable this method to get out of local minima, 
we have to make it stochastic. One way to do this is to make the coefficients a, 
j3, and 7 random variables within appropriate ranges, such that the method can 
explore a wider area of the input domain. Another simple way is to run the search 
procedure repeatedly from various initial points selected randomly. 

The downhill simplex search is implemented as the command f mins in MATLAB. 
To use this command, type help fmins at the MATLAB prompt to get more infor- 
mation. 


7.6 SUMMARY 

This chapter presents four of the most popular derivative-free optimization meth- 
ods: genetic algorithms, simulated annealing, random search, and downhill simplex 
search. These techniques rely on modern high-speed computers; they all require a 
significant amount of computation when compared with derivative-based approaches 
(Chapters 5 and 6). However, these derivative-free approaches are more flexible 
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in terms of incorporating intuitive guidelines and forming sophisticated objective 
functions. 


EXERCISES 


1. For a traveling salesperson problem in which the cost to travel from one city to 
another is equal to the Euclidean distance between them, a closed tour is not 
optimal whenever there is a cross in the tour. Explain why. 

2. At the beginning of tsp.m, we have to fill a distance matrix such that the 
element at row i and column j in the distance matrix is equal to the travel 
distance between city i and city j. In tsp.m, this is done by two nested for 
loops: 


for i = l:NumCity, 

for j = l:NumCity, 
distance(i, 

end 


end 


j) 


norm(loc(i, :) - loc(j, :)); 


where NumCity is the number of cities. Can you vectorize this code to get rid 
of the for loops? 

3. Can you speed up tsp.m? Explain what can be done to make it run faster. 
Use the ideas you come up with to modify tsp.m and see how much speedup 
you actually gain. 

4. Modify tsp.m such that whenever the path comes across a circle centered at 
[0.5, 0.5] with radius 0.3, an extra cost of 0.5 is incurred. (Think of the circle 
as a river and you have to pay a toll to cross it.) 

5. Repeat Exercise 4, but —0.3 is incurred instead. (Think of the circle as a 
national boundary that you are smuggling something across whenever you pass 
it.) 

6. Modify the program tsp.m to use the translation move set and compare the 
performance with the original program using the default 30-city problem. Note 
that SA is a stochastic optimization procedure, so the comparison should be 
based on a number of simulation runs. Try to do 10 runs on both programs 
and compare their average running time and objective function values. 

7. Repeat Exercise 6, but use the switching move set. 



REFERENCES 


195 


8. Use the command fmins to minimize Rosenbrock’s parabolic valley [19] (also 
known as the banana function): 

f{x i,x 2 ) = 100(x 2 - x 2 ) 2 + (1 - xi) 2 . 

The starting point is (—1.2, 1). 

9. Repeat Exercise 8 using the random search method. Compare the average 
running time and objective function value over 10 runs with those of fmins. 

10. Use the command fmins to minimize Powell’s quaxtic function [17]: 

f(x i,x 2 ,x 3 ,x 4 ) = (xi + 10x 2 ) 2 + 5(x 3 - x 4 ) 2 + (x 2 - 2x 3 ) 4 + 10(xi - x4) 4 . 
The starting point is (3,-1, 0, 1). 

11. Repeat Exercise 10 using the random search method. Compare the average 
running time and objective function value over 10 runs with those of fmins. 

12. Rewrite the random search method randsrch.m to include the feature of de- 
creasing variances of dx with respect to time. Run go_rand.m again and 
compare the result to Figure 7.10. 

13 . Repeat Exercise 9 using the modified randsrch.m from Exercise 12. 

14 . Repeat Exercise 11 using the modified randsrch.m from Exercise 12. 
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Adaptive Networks 
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8.1 INTRODUCTION 

This chapter describes the architectures and learning procedures of adaptive net- 
works, a unifying framework that subsumes almost all kinds of neural network 
paradigms with supervised learning capabilities. The fundamentals of adaptive 
networks will be a key element in understanding other various neural network 
paradigms (such as multilayer perceptrons and radial basis function networks) in- 
troduced in the subsequent chapters. 

An adaptive network, as the name indicates, is a network structure consisting 
of a number of nodes connected through directional links. Each node represents a 
process unit, and the links between nodes specify the causal relationship between 
the connected nodes. All or part of the nodes are adaptive, which means the 
outputs of these nodes depend on modifiable parameters pertaining to these nodes. 
The learning rule specifies how these parameters should be updated to minimize 
a prescribed error measure, which is a mathematical expression that measures the 
discrepancy between the network’s actual output and a desired output. In other 
words, an adaptive network is used for system identification (see Chapter 5), and 
our task is to find an appropriate network architecture and a set of parameters which 
can best model an unknown target system that is described by a set of input-output 
data pairs. 

The basic learning rule of the adaptive network is the well-known steepest de- 
scent method, in which the gradient vector is derived by successive invocations 
of the chain rule. This method for systematic calculation of the gradient vector 
was proposed independently several times, by Bryson and Ho [1], Werbos [16], and 
Parker [9]. However, because research on artificial neural networks was still in its 
infancy at those times, these researchers’ early work failed to receive the attention it 
deserved. In 1986, Rumelhart et al. [11] used the same procedure to find the gradi- 
ent in a multilayer neural network. Their procedure was called the backpropagation 
learning rule, a name which is now widely known because the work of Rumelhart et 
al. inspired enormous interest in research on neural networks. In this chapter, we 
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Figure 8.1. A feedforward adaptive network in layered representation. 


introduce Werbos’s original backpropagation method for finding gradient vectors 
and also present an improved version [3, 4] which speeds up the time-consuming 
learning process by incorporating the least-squares method. 


8.2 ARCHITECTURE 

As the name implies, an adaptive network (Figure 8.1) is a network structure 
whose overall input-output behavior is determined by a collection of modifiable 
parameters. Specifically, the configuration of an adaptive network is composed of a 
set of nodes connected by directed links, where each node performs a static node 
function on its incoming signals to generate a single node output and each link 
specifies the direction of signal flow from one node to another. Usually a node 
function is a parameterized function with modifiable parameters; by changing these 
parameters, we change the node function as well as the overall behavior of the 
adaptive network. 

In the following discussion, we shall assume that each node in an adaptive net- 
work performs a static mapping from its input(s) to output. Namely, a node’s 
output depends on its current inputs only; there are no dynamics or internal states 
in each node. Moreover, to facilitate the development of learning algorithms, we 
assume that all node functions are differentiable except at a finite number of points. 
In the most general case, an adaptive network is heterogeneous and each node may 
have a specific node function different from the others. Links in an adaptive net- 
work are merely used to specify the propagation direction of node outputs; generally 
there are no weights or parameters associated with links. Figure 8.1 is a typical 
adaptive network with two inputs and two outputs. 

The parameters of an adaptive network are distributed into its nodes, so each 
node has a local parameter set. The union of these local parameter sets is the 
network’s overall parameter set. If a node’s parameter set is not empty, then its 
node function depends on the parameter values; we use a square to represent this 


Sec. 8.2. Architecture 


201 



(b) 


Figure 8.2. Decomposition of adaptive nodes: (a) a single node; (b) parameter 
sharing problem. 


kind of adaptive node. On the other hand, if a node has an empty parameter set, 
then its function is fixed; we use a circle to denote this type of fixed node. Each 
adaptive node can be decomposed into a fixed node plus one or several parameter 
nodes, as illustrated in the following example. 

Example 8.1 Parameter sharing in adaptive networks 

Figure 8.2(a) shows an adaptive network with only one node, which can be repre- 
sented as y = f(x,a ), where x and y are the input and output, respectively, and a is 
the parameter of the node. An equivalent representation is to move the parameter 
out of the node and put it into a parameter node, as shown in Figure 8.2(a). 
It is obvious that a parameter node is a special case of an adaptive node in which 
there are no inputs and the output is the parameter itself. The parameter node is 
useful in solving certain representation problems, such as the parameter sharing 
problem in Figure 8.2(b), where two adaptive nodes u = g(x,a) and v = h(y,a) 
share the same parameter a, as denoted by the dotted line linking these two nodes. 
By taking out the parameter and putting it into a parameter node, we can embed 
the parameter sharing requirement into the architecture. This simplifies network 
representation as well as software implementation. 


□ 

Adaptive networks are generally classified into two categories on the basis of the 
type of connections they have: feedforward and recurrent. The adaptive network 
shown in Figure 8.1 is feedforward, since the output of each node propagates from 
the input side (left) to the output side (right) unanimously. If there is a feedback link 
that forms a circular path in a network, then the network is recurrent; Figure 8.3 
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Figure 8.3. A recurrent adaptive network. 



Figure 8.4. A feedforward adaptive network in topological ordering representation. 


is an example. (From the viewpoint of graph theory, a feedforward network is 
represented by an acyclic directed graph which contains no directed cycles, while a 
recurrent network always contains at least one directed cycle.) 

In the layered representation of the feedforward adaptive network in Fig- 
ure 8.1, there axe no links between nodes in the same layer, and outputs of nodes 
in a specific layer are always connected to nodes in succeeding layers. This repre- 
sentation is usually preferred because of its modularity, in that nodes in the same 
layer have the same functionality or generate the same level of abstraction about 
input vectors. 

Another representation of feedforward networks is the topological ordering 
representation, which labels the nodes in an ordered sequence 1,2,3,..., such 
that there are no links from node i to node j whenever i > j. Figure 8.4 is the 
topological ordering representation of the network in Figure 8.1. This representation 
is less modular than the layer representation, but it facilitates the formulation of 
learning rules, as will be detailed in the next section. (Note that the topological 
ordering representation is in fact a special case of the layered representation, with 
one node per layer.) 

Conceptually, a feedforward adaptive network is actually a static mapping be- 
tween its input and output spaces; this mapping may be either a simple linear 
relationship or a highly nonlinear one, depending on the network structure (node 
arrangement and connections, and so on) and the functionality for each node. Here 
our aim is to construct a network for achieving a desired nonlinear mapping that 
is regulated by a data set consisting of desired input-output pairs of a target sys- 
tem to be modeled. This data set is usually called the training data set, and 
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Figure 8.5. A linear single-node adaptive network. 


the procedures we follow in adjusting the parameters to improve the network’s per- 
formance are often referred to as the learning rules or adaptation algorithms. 
Usually a network’s performance is measured as the discrepancy between the desired 
output and the network’s output under the same input conditions. This discrep- 
ancy is called the error measure and it can assume different forms for different 
applications. Generally speaking, a learning rule is derived by applying a specific 
optimization technique to a given error measure. 

Before introducing a basic learning algorithm for adaptive networks, we shall 
present several examples of adaptive networks. 

Example 8.2 An adaptive network with a single linear node 

Figure 8.5 is an adaptive network with a single node specified by 


Xz — fz{.X\ j 5 0\ , 0>2 1 ^3) — dlXi + 02X2 + 0-3, 

where aq and X 2 are inputs and a±, 0 , 2 , and 0,3 are modifiable parameters. The 
function defines a plane in xi — X 2 — X 3 space, and by setting appropriate values for 
the parameters, we can place this plane arbitrarily. By adopting the squared error 
as the error measure for this network, we can identify the optimal parameters via 
the linear least-squares estimation method introduced in Chapter 5. 


□ 


Example 8.3 Perceptron network 

If we add another node to let the output of the adaptive network in Figure 8.5 have 
only two values 0 and 1; then the nonlinear network shown in Figure 8.6 is obtained. 
Specifically, the node outputs are expressed as 


X3 /3 (*^1 1 X2 , fll , 0/2 j U3) — OiX\ -|“ 0,2X2 -(- O3 , 


and 


x± 



if X 3 > 0 
if X 3 < 0 ’ 


where fz is a linearly parameterized function and / 4 is a step function which maps 
X 3 to either 0 or 1. The overall function of this network can be viewed as a linear 
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classifier: The first node forms a decision boundary as a straight line in x\ — x 2 
space, and the second node indicates which half plane the input vector (xi, x 2 ) 
resides in. Obviously, we can form an equivalent network with a single node whose 
function is the composition of fo and / 4 ; the resulting node is the building block of 
the classical percept ron [8, 10]. 

Since the step function is discontinuous at one point and flat at all the other 
points, it is not suitable for derivative-based learning procedures. One way to get 
around this difficulty is to use the sigmoidal function as a squashing function 
that has values between 0 and 1: 

Xi = fi(x 3 ) = 

This is a continuous and differentiable approximation to the step function. The 
composition of fo and this differentiable / 4 is the building block for the multilayer 
perceptron in the following example. 


□ 


Example 8.4 A multilayer perceptron 

Figure 8.7 is a typical architecture for a multilayer perceptron with three inputs, 
two outputs, and three hidden nodes that do not connect directly to either inputs 
or outputs. Each node in a network of this kind has the same node function, which 
is the composition of a linear fo and a sigmoidal / 4 in Example 8.3. For instance, 
the node function of node 7 in Figure 8.7 is 

1 

1 -I- exp[— (iu 4 ) 7 X4 + W 5 JX 5 + We, 7X6 + t 7 )] ’ 

where r 4 , x 5 , and r 6 are outputs from nodes 4, 5, and 6, respectively, and the 
parameter set of node 7 is denoted by {iu 4) 7 , w^j, 1 ^ 6 , 7 , £ 7 }- Usually we view Wij 
as the weight associated with the link connecting node i and j and tj as the 
threshold associated with node j. However, this weight-link association is only 
valid in this type of network. In general, a link only indicates the signal flow 
direction between connected nodes, as will be shown in other types of adaptive 
networks in the subsequent discussion. 

A more detailed discussion about the structure and learning rules of the multi- 
layer perceptron is in presented Section 9.4. 
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Figure 8.7. A 3-3-2 neural network. 

□ 

8.3 BACKPROPAGATION FOR FEEDFORWARD NETWORKS 

This section introduces a basic learning rule for adaptive networks, which is in 
essence the simple steepest descent method discussed in Section 6.3 of Chapter 6. 
The central part of this learning rule concerns how to recursively obtain a gradient 
vector in which each element is defined as the derivative of an error measure with 
respect to a parameter. This is done by means of the chain rule, a basic formula for 
differentiating composite functions that is covered in every textbook on elementary 
calculus. The procedure of finding a gradient vector in a network structure is 
generally referred to as backpropagation because the gradient vector is calculated 
in the direction opposite to the flow of the output of each node. Once the gradient 
is obtained, a number of derivative-based optimization and regression techniques 
(Chapters 5 and 6) are available for updating the parameters. In particular, if we 
use the gradient vector in a simple steepest descent method, the resulting learning 
paradigm is often referred to as the backpropagation learning rule. We shall 
introduce this learning rule in the rest of this section. 

Suppose that a given feedforward adaptive network in the layered representation 
has L layers and layer / (Z = 0, 1, ..., L; / = 0 represents the input layer) has N(l) 
nodes. Then the output and function of node i [i = 1, . . . , iV(Z)] in layer l can be 
represented as xij and //,*, respectively, as shown in Figure 8.8(a). Without loss 
of generality, we assume that there are no jumping links (that is, links connecting 
nonconsecutive layers). Since the output of a node depends on the incoming signals 
and the parameter set of the node, we have the following general expression for the 
node function 

where a, /3, 7, etc. are the parameters of this node. 


( 8 . 1 ) 
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(b) 

Figure 8.8. Our nototional conventions: (a) layered representation; (b) topological 
ordering representation. 


Assuming that the given training data set has P entries, we can define an error 
measure for the pth (1 < p < P) entry of the training data as the sum of squared 
errors: 

N(L) 

E p = ^2 (dk -XL,k) 2 , (8-2) 

*=i 

where dk is the kth component of the pth desired output vector and xl,l is the kth 
component of the actual output vector produced by presenting the pth input vector 
to the network. (For notational simplicity, we omit the subscript p for both d* and 
XL,k-) Obviously, when E p is equal to zero, the network is able to reproduce exactly 
the desired output vector in the pth training data pair. Thus our task here is to 
minimize an overall error measure, which is defined as E = 5Z p =i E p . 

Remember that the definition of E p in Equation (8.2) is not universal; other 
definitions of E p are possible for specific situations or applications. Therefore, we 
shall avoid using an explicit expression for the error measure E p to emphasize the 
generality. In addition, we assume that E p depends on the output nodes only; more 
general situations will be discussed later. 

To use steepest descent to minimize the error measure, first we have to obtain 
the gradient vector. Before calculating the gradient vector, we should observe the 
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following causal relationships: 


change in 


change in 


change in 

parameter 


outputs of nodes 


network’s 

a 


containing a 


outputs 


change in 
error measure 


where the arrows => indicate causal relationships. In other words, a small change 
in a parameter a will affect the output of the node containing a; this in turn will 
affect the output of the final layer and thus the error measure. Therefore, the basic 
concept in calculating the gradient vector is to pass a form of derivative information 
starting from the output layer and going backward layer by layer until the input 
layer is reached. 

To facilitate the discussion, we define the error signal e* as the derivative of 
the error measure E p with respect to the output of node i in layer Z, taking both 
direct and indirect paths into consideration. In symbols, 




d+E p 

dxi,i 


(8.3) 


This expression was called the ordered derivative by Werbos [16]. The difference 
between the ordered derivative and the ordinary partial derivative lies in the way 
we view the function to be differentiated. For an internal node output x*,i (where 

q IT 1 

l L), the partial derivative is equal to zero, since E p does not depend on 
xu directly. However, it is obvious that E p does depend on indirectly, since 
a change in x^i will propagate through indirect paths to the output layer and 
thus produce a corresponding change in the value of E p . Therefore, e/,* can be 
viewed as the ratio of these two changes when they axe made infinitesimal. The 
following example demonstrates the difference between the ordered derivative and 
the ordinary partial derivative. 


Example 8.5 Ordered derivatives and ordinary partial derivatives 


Consider a simple adaptive network shown in Figure 8.9, where z is a function of x 
and y , and y is in turn a function of x: 


j z = g{x,y), 

l V = fix). 


For the ordinary partial derivative ^ , 
(in this case, y) are constant: 


we assume that all the other input variables 


dz_ dg(x,y ) 

dx dx 

In other words, we assume that the inputs x and y to the function g are independent, 
without paying attention to the fact that y is actually a function of x. For the 
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Z 


Figure 8.9. Ordered derivatives and ordinary partial derivatives (Example 8.5). 


ordered derivative, we take this indirect causal relationship into consideration: 
d+z dg(x , f(x)) 


dx 


dx 

dd(x, V ) 

dx 


+ 


dg{x,y) 


V—f(x) 


9f{x) 
dx 


i y=f(x) 

Therefore, the ordered derivative takes into consideration both the direct and indi- 
rect paths that lead to the causal relationship. 

□ 


The error signal for the ith output node (at layer L) can be calculated directly: 


_ d + E p _ dE p 

€L,t dXL,i dxL,i 


(8.4) 


This is equal to eL,i = -2 (di — xl,i) if E p is defined as in Equation (8.2). For the 
internal node at the ith position of layer Z, the error signal can be derived by the 
chain rule: 


Q,i = 


d+E n 


N(l+ 1 ) 

= E 

m= 1 


o +E , a r N(t+1) 

d^E p dfi+ i, m 'ip < 

i+i.™ dx ‘.‘ 


dx 


dfl+l t m 
dxi,i 


error signal error signal 

at layer l at layer / + 1 

(8.5) 

where 0 < l < L — 1. That is, the error signal of an internal node at layer l can 
be expressed as a linear combination of the error signal of the nodes at layer l - 1-1. 

Therefore, for any l and i [0 < l < L and 1 < i < N(l)], we can find e*,* = ~ Q Xl m f 
by first applying Equation (8.4) once to get error signals at the output layer, and 
then applying Equation (8.5) iteratively until we reach the desired layer l. The 
underlying procedure is called backpropagation since the error signals are obtained 
sequentially from the output layer back to the input layer. 

The gradient vector is defined as the derivative of the error measure with respect 
to each parameter, so we have to apply the chain rule again to find the gradient 
vector. If a is a parameter of the *th node at layer Z, we have 


d^Ep = d+Epdfij, = dfi,i 
da dxi,i da 6l,t da 


( 8 . 6 ) 
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Note that if we allow the parameter a to be shared between different nodes, then 
Equation (8.6) should be changed to a more general form: 

a^E z=sr d+E.dr ,„ 7 , 

da +, dx' da ’ 1 

X* GS 

where S is the set of nodes containing a as a parameter; and x* and /* are the 
output and function, respectively, of a generic node in S. 

The derivative of the overall error measure E with respect to a is 


d+E _ ? +E p 

p= i 


da 


da 


( 8 . 8 ) 


Accordingly, for simple steepest descent without line minimization, the update 
formula for the generic parameter a is 


Aq = — 7 ; 


d + E 
da ’ 


in which 77 is the learning rate, which can be further expressed as 


(8.9) 


77 = 



( 8 . 10 ) 


where k is the step size, the length of each transition along the gradient direction 
in the parameter space. Usually we can change the step size to vary the speed of 
convergence; see Section 6.7.2 of Chapter 6. 

When an 77 ,-node feedforward network is represented in its topological order, we 
can envision the error measure E p as the output of an additional node with index 
71 + I, whose node function f n +i can be defined on the outputs of any nodes with 
smaller index; see Figure 8.8(b). (Therefore, E p may depend directly on any nodes.) 
Applying the chain rule again, we have the following concise formula for calculating 
the error signal e* = dEpfdxc 


d+E p = 9f n+1 y, d+E p dfj 
dxi dxi . “5 dxj dxi ’ 


( 8 . 11 ) 


or 


= d/n-t - 1 d/j 

dxi j dxi ’ 

t<j<n 


( 8 . 12 ) 


where the first term shows the direct effect of x* on E p via the direct path from 
node i to node 71 + 1 and each product term in the summation indicates the indirect 
effect of Xi on E p . Once we find the error signal for each node, then the gradient 
vector for the parameters is derived as before. 
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Another systematic way to calculate the error signals is through the represen- 
tation of the error-propagation network (or sensitivity model), which is ob- 
tained from the original adaptive network by reversing the links and supplying 
the error signals at the output layer as inputs to the new network. The following 
example illustrates the idea. 

Example 8.6 Adaptive network and its error-propagation model 


Figure 8.10(a) is an adaptive network, where each node is indexed by a unique 
number. Again, we use fo and Xi to denote the function and output of node i. 
To calculate the error signals at internal nodes, an error-propagation network is 
constructed in Figure 8.10(b), where the output of node i is the error signal of this 
node in the original adaptive network. In symbols, if we choose the squared error 
measure for E p , then we have the following: 


ee 


69 ~ ~ aif ~ 2 X9 )’ 

= Hf = ~ 2 K - x 8 ), 

d + E v _ d + E v df s , d + E p df 9 _ r df 8 

9x7 dX8 0X 7 ' dig dx 7 8 dx 7 

d + E p _ 8 + E p dh , 9 + E p dfg __ , dfs 
dxe 9x8 9xe ' 9xg 9xe ® 9x6 


+ eg 
+ eg 


9h 
9x 7 ’ 
01SL. 
9x& 


Thus nodes 9 and 8 in the error-propagation network are only buffer nodes. Similar 
expressions can be written for the error signals of nodes 1, 2, 3, 4, and 5. It is 
interesting to observe that in the error-propagation net, if we associate each link 
connecting nodes i and j {i < j) with a weight wu = 1^'-, then each node performs 
a linear function and the error-propagation net is actually a linear network. 


□ 


There are two types of learning paradigms that are available to suit the needs 
for various applications. In off-line learning (or batch learning), the update 
formula for parameter a is based on Equation (8.8) and the update action takes 
place only after the whole training data set has been presented — that is, only after 
each epoch or sweep. On the other hand, in on-line learning (or pattern- 
by-pattern learning), the parameters are updated immediately after each input- 
output pair has been presented, and the update formula is based on Equation (8.6). 
In practice, it is possible to combine these two learning modes and update the 
parameter after k training data entries have been presented, where k is between 
1 and P and it is sometimes referred to as the epoch size. These two types of 
learning paradigms are described in greater detail in Section 8.5, where a hybrid 
learning rule is introduced. 


8.4 EXTENDED BACKPROPAGATION FOR RECURRENT NET- 
WORKS 


For recurrent adaptive networks, it is possible to derive an extended version of the 
backpropagation procedure that finds gradient vectors. To simplify our notation, 
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Figure 8.11. A simple recurrent network. 


we shall use the network in Figure 8.11 for most of our discussion, where x\ and X 2 
are inputs and £5 and xq are output nodes. Because it has directional loops 3-4-5, 
3-4-6-5, and 6 (a self-loop), this is a typical recurrent network with node functions 
denoted as follows: 


X3 = 

£4 = 

x 5 = 
£6 = 


/3(Zl,X 5 ), 

/4(X2,£3), 

/5(Z4,X 6 ), 

/ 6 (£ 4 ,X 6 ). 


(8.13) 


To derive correctly the backpropagation procedure for the recurrent net in Fig- 
ure 8.11, we have to distinguish two operating modes through which the network 
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may satisfy Equation (8.13). These two modes are synchronous operation and 
continuous operation; the backpropagation procedures corresponding to these 
two operating modes are described next. 


8.4.1 Synchronously Operated Networks: BPTT and RTRL 

If a network is operated synchronously, all nodes change their outputs simultane- 
ously according to a global clock signal and there is a time delay associated with 
each link. This synchronization is reflected by adding the time t as an argument 
to the output of each node in Equation (8.13) (assuming there is a unit time delay 
associated with each link): 


x 3 (t + 1) 

* 4 (t + 1) 

x 5 {t+ 1) 
. *e(*+ 1) 


/3(Xi(*),Z 5 (*)), 

h{x 2 {t),X 3 {t)), 

f5(x A (t),X 6 (t)), 

f6(x4(t),x 6 (t)). 


(8.14) 


Backpropagation Through Time (BPTT) 

When using synchronously operated networks, we usually are interested in identi- 
fying a set of parameters that will make the output of a node (or several nodes) 
follow a given trajectory (or trajectories) in the discrete time domain. This problem 
of tracking or trajectory following is usually solved by using a method called 
unfolding of time to transform a recurrent network into a feedforward one, as 
long as the time t does not exceed a reasonable maximum T. This idea was origi- 
nally introduced by Minsky and Papert [7] and combined with backpropagation by 
Rumelhart, et a!. [11]. Consider the recurrent net in Figure 8.11, which is redrawn 
in Figure 8.12(a) with the same configuration except that the input variables x\ 
and x 2 are omitted for simplicity. The same network in a feedforward architecture 
is shown in Figure 8.12(b), with the time index t running from 1 to 4. In other 
words, for a recurrent net that synchronously evaluates each of its node functions at 
t= 1, 2, . . . , T, we can simply duplicate all units T times and arrange the resulting 
network in a layered feedforward manner. 

It is obvious that the two networks in Figures 8.12(a) and 8.12(b) behave iden- 
tically for t = 1 to T, provided that all copies of parameters across different time 
steps remain identical. For instance, the parameter in node 3 of Figure 8.12(a) 
must be the same at all time instants. This is the parameter sharing problem ad- 
dressed in Example 8.1. A quick and dirty solution to this problem is to move 
the parameters from nodes 3 and 6 into the so-called parameter nodes, which are 
independent of the time step, as shown in Figure 8.13. (Without loss of generality, 
we assume that nodes 3 and 6 both have only one parameter, denoted by a and 6, 
respectively.) After setting up the parameter nodes in this way, we can apply the 
backpropagation procedure as usual to the network (which is still feedforward in 
nature) in Figure 8.13 without the slightest concern about the parameter sharing 
constraint. Note that the error signals of a parameter node come from nodes located 
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Figure 8.12. (a) A synchronously operated recurrent network and (b) its feedfor- 
ward equivalent obtained via unfolding of time. 


at layers across different time instants; thus the backpropagation procedure (and 
the corresponding steepest descent) for this kind of unfolded network is often called 

backpropagation through time (BPTT). 


Real-Time Recurrent Learning (RTRL) 

BPTT generally works well for most problems; the only complication is that it 
requires extensive computing resources when the sequence length T is large, because 
the duplication of nodes makes both memory requirements and simulation time 
proportional to T. Therefore, for long sequences or sequences of unknown length, 
real-time recurrent learning (RTRL) [18] is employed instead to perform on-line 
learning — that is, to update parameters while the network is running rather than 
at the end of the presented sequences. 

To explain the rationale behind RTRL, we take as an example the simple recur- 
rent network in Figure 8.14(a), where there is only one node with one parameter 
a. After moving the parameter out of the unfolded architecture, we obtain the 
feedforward network shown in Figure 8.14(b). Figure 8.14(c) is the corresponding 
error-propagation network. Here we assume E = E{ — ]T\ (di — Xi) 2 , where 
i is the index for time and di and x$ are the desired and the actual node output, 
respectively, at time instant i. 

To save computation and memory requirements, a sensible choice is to minimize 
Ei at each time step instead of trying to minimize E at the end of a sequences. 
To achieve this, we need to calculate d + E/da recursively at each time step i. For 
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Figure 8.13. An alternative representation of Figure 8.12(b) that satisfies the 
parameter sharing requirement. 


(a) 



se, dE, ee 

9X, d.X 2 d.X, Q.X, 







• • • 


Figure 8.14. A simple recurrent adaptive network to illustrate RTRL: (a) a recur- 
rent net with single node and single parameter ; (b) unfolding-of-time architecture; 
(c) error-propagation network. 


i = 1, the error-propagation network is as shown in Figure 8.15(a) and we have 

d + xi _ dxi d + Ei _ dE\ d + x\ 
da da 311 da dx\ da 


(8.15) 
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Figure 8.15. Error-propagation networks at different time steps: (a) i = 1; (b) 
i = 2; (c) i = 3; (d) a general situation where the thick arrow represents - . 


For i = 2, the error-propagation network is as shown in Figure 8.15(b) 

d + X2 _ 0x2 dx2 d+xi d+E2 _ dE2 d + X2 

da da dx\ da an da dx 2 da 

For i = 3, the error-propagation network is as shown in Figure 8.15(c) 


and we have 
(8.16) 
and we have 


d+x 3 _ dx 3 dx 3 d+x 2 d+E 3 _ dE 3 d+x 3 

da da dx 2 da 811 da dx 3 da 

In general, for the error-propagation at time instant i, we have 


(8.17) 


d + Xi _ dxi dxi d + Xi-i d+Ei _ dEi d+x 3 
da da dxi-i da 311 da dxi da ’ 


(8.18) 


where 


d + Xj - 1 

da 


is already available from the calculation at the previous time instant. 


Figure 8.15 shows this general situation, where the thick arrow represents — , 

which is already available at the time instant i — 1. 

Therefore, by trying to minimize each individual Ei , we can recursively find 
q+E 

the gradient at each time instant; there is no need to wait until the end of 

the presented sequence. Since this is an approximation of the original BPTT, the 
learning rate 77 in the steepest descent update formula 


A a = — 7 ] 


d+Ei, 

da 


should be kept small and, as a result, the learning process usually takes longer. 


8.4.2 Continuously Operated Networks: Mason's Gain Formula* 

In a network that is operated continuously, all nodes continuously change their out- 
puts until Equation (8.13) is satisfied. This operating mode is of particular interest 



216 


Adaptive Networks Ch. 8 


for analog circuit implementations, where a certain kind of dynamical evolution 
rule is imposed on the network. For instance, the dynamical formula for node 3 in 
Figure 8.11 can be written as 


t 3 -^- + x 3 = /3(*i,* 5 )- (8.19) 

Similar formulas can be devised for other nodes. It is obvious that when x 3 (t ) 
stops changing (i.e., = 0), Equation (8.19) leads to the correct fixed points 

satisfying Equation (8.13). We assume that at least one such stable fixed point 
exists for every possible node output in Equation (8.13). Other situations, such 
as limit cycles, can be either generated or eliminated via the approximation of a 
synchronously operated network using the techniques introduced in Section 8.4.1. 

By assuming that the error measure E is a function of the output nodes — that 
is, E = E{x 5 ,X 6 ) — we obtain the following equations via repeated applications of 
the chain rule: 


d+E 

~cTx 3 

d + E 

dx 4 

d+E 

~dx 5 

d+E 


d+E df 4 
dx 4 ox 3 
d+E df 5 
dx* ox a 
d+E df 3 
dx 3 dx 5 
d+E a/ 5 
dx 5 dx($ 


d+E a/e 
diQ ox 4 
dE 

dx 5 ’ 

d+E df 6 

dX(y dX(y 


( 8 . 20 ) 


+ 


dE 

Oxq 


As before, if the error signal d+E/dxi is denoted as e*, Equation (8.20) can be 
simplified to the following: 


or, equivalently, 


1 

0 

-W 35 

0 

d 



e 4 w 43 , 

e^W54 4- €6^64, 

flip 

e 3 w 3b + J^-, 

flF 


—w 43 0 0 

1 -W54 -WQ4 

0 10 
0 ~WS6 1 - W66 



’ e 3 ' 


0 


€4 


0 


€5 


dE/dx 5 


€e 


dE/dx 6 


( 8 . 21 ) 


( 8 . 22 ) 


where Wij = . Then ej can be obtained through the standard method for linear 

OXj 

algebraic equations. Once we have e*, the gradient for a generic parameter a in 
node i can be found directly: 


d^E = d^Edfi = dh 

da dx{ da C * da 


(8.23) 
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Figure 8.16. A recurrent eiTor-propagation network corresponding to Figure 8.11. 


It is of interest to note that Equation (8.21) can be represented as the recurrent 
network shown in Figure 8.16, where the output of node i is the error signal e*. The 
topology of this recurrent error-propagation network is the same as that of 
the original network (Figure 8.11), except that the direction of each link is reversed 
and the link from node i to j is associated with a weight Wij, defined by dfi/dxj. 

Moreover, two quantities and are provided as the inputs to this network. 

As a result, if we have hardware-implemented networks, the calculation of e* would 
be similar to finding the fixed points of the original network, which is topologically 
equivalent to its error-propagation network. 

Without hardware-implemented networks, solving Equation (8.22) would seem 
to require lengthy calculation using the Gaussian elimination method or other sim- 
ilar techniques. An alternative approach to obtaining e* is Mason’s gain for- 
mula [6], which is commonly employed to find transfer functions of linear systems 
represented in signal flow graphs or block diagrams. The signal flow graph [5] may 
be regarded as a cause-and-effect representation of a linear system; our recurrent 
error-propagation network undoubtedly is such a system. Therefore, by applying 
the following gain formula, we can obtain e* by mere inspection. 

Theorem 8.1 Mason’s gain formula [6] for the recurrent error-propagation net- 
work 


The general gain formula between e* and an input quantity I is 


dti 




dl 



MfcAfc 

A ’ 


(8.24) 


where 
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Figure 8.17. Loops in the recurrent error-propagation network of Figure 8.16. 
(Note that loop 1 and 3 are nontouching .) 


= gain between e* and I 

= output of node i of the recurrent error-propagation network 
= input quantity 

= total number of forward paths from I to e* 

= gain of the kth. forward path 

= 1 — Prnl + X^m ^m2 — X^m + ' ' ' 

= gain product of the mth possible combination of r nontouching 
loops, that is, loops sharing no common nodes 
= the A for that part of the recurrent error-propagation network 
which is nontouching with the kth. forward path 

□ 

At first glance, calculating Mason’s gain formula may seem to be a formidable 
task because of the complex expressions for A and A*. In practice, however, this 
gain formula is straightforward, since recurrent error-propagation networks with a 
large number of nontouching loops are rare. The following example illustrates how 
to apply Mason’s gain formula for the error- propagation network in Figure 8.16. 

Example 8.7 Mason’s gain formula 

To express the error signal €3 in terms of the inputs Ii and I 2 in Figure 8.16, we 
first observe that there are three loops in the recurrent error-propagation network, 
as shown in Figure 8.17. The gains for these loops are 

loop 1 (5-4-3): h = 1 U 54 W 43 W 35 , 

loop 2 (5-6-4-3): Z 2 = ^ 56 ^ 64 ^ 43 ^ 35 , (8.25) 

loop 3 (6): l 3 = w 6 e, 


M 

e, 

I 

N 

M k 

A 

P 

mr 

A* 



Sec. 8.5. Hybrid Learning Rule: Combining Steepest Descent and LSE 


219 


where loop 1 and loop 3 are nontouching loops, since they do not share a common 
node. Thus the A in Equation (8.24) [which is equal to the determinant of the 
square matrix in Equation (8.22)] is expressed as 


A — 1 — (Zi + 12 + ^3) + {Ills)- 


(8.26) 


To find the gain between 63 and Ii , note that there are two direct paths between 
them: /i-5-4-3 and 7i-5-6-4-3. For the first path, we have M\ = W54W43 and 
Ai = 1 — u> 66> since only loop 3 is nontouching with respect to this path. For the 
second path, we have M2 = and A2 = 1, since no loop is nontouching 

with respect to this path. Therefore, the gain <71 between €3 and 7i is 


9 1 


M1A1 M2A2 

^54^43^1- wee) 

A 


+ IU56 


(8.27) 


In contrast, there is only one direct path (72-6-4-3) connecting €3 and I 2 . Con- 
sequently, we have M\ = wq^w^s and Ai = 1, and the gain #2 between €3 and I 2 
is 

= Mi A! 

_ wmWs (8.28) 

A 

Since the recurrent error-propagation network is a linear system, the principle of 
superposition applies. Thus, by combining Equations (8.27) and (8.28), we obtain 
the error signal €3 as follows: 


£3 — 9 ih + 92I2 

— ^ 54 ^ 43(1 — wqq) + W5QW64W43 QE . W64W43 QE ( 8 . 29 ) 

A 8 x 5 A 8 x 6 ' 

□ 


In essence, the function of the continuously operated recurrent network is still 
a static mapping when the property of dynamical evolution [see Equation (8.19)] is 
ignored. There is no theoretical proof or derivation comparing the approximation 
power of a continuously operated recurrent network with that of a typical ordi- 
nary feedforward network. Moreover, calculating the stable attractor defined by 
Equation (8.13) could be time-consuming in software simulations of the network. 
As a result, continuously operated recurrent networks are not used as often as the 
synchronously operated recurrent networks described in Section 8.4.1. 


8.5 HYBRID LEARNING RULE: COMBINING STEEPEST DE- 
SCENT AND LSE 


Although we can apply backpropagation or steepest descent to identify the param- 
eters in an adaptive network, this simple optimization method usually takes a long 
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time before it converges. We may observe, however, that an adaptive network’s out- 
put (assuming there is only one) is linear in some of the network’s parameters; thus 
we can identify these linear parameters by the linear legist-squares method described 
in Chapter 5. This approach leads to a hybrid learning rule [3, 4] which combines 
steepest descent (SD) and the least-squares estimator (LSE) for fast identification 
of parameters. 


8.5.1 Off-Line Learning (Batch Learning) 

For simplicity, assume that the adaptive network under consideration has only one 
output represented by 

o = F(i, 5), (8.30) 

where i is the vector of input variables, S is the set of parameters, and F is the 
overall function implemented by the adaptive network. If there exists a function H 
such that the composite function H o F is linear in some of the elements of S, then 
these elements can be identified by the least-squares method. More formally, if the 
parameter set S can be divided into two sets 


S = Si©S 2 , (8.31) 

(where © represents direct sum) such that H o F is linear in the elements of S 2 , 
then upon applying H to Equation (8.30), we have 

H{o) = Ho F(Bi , 5), (8.32) 

which is linear in the elements of 5 2 . Now given values of elements of Si, we can 
plug P training data into Equation (8.32) and obtain a matrix equation: 


AO = y 


(8.33) 


where 0 is an unknown vector whose elements are parameters in S 2 . Obviously 
Equation (8.33) is exactly the same as Equation (5.14) in Section 5.3; thus this is a 
standard linear least-squares problem, and the best solution for 0, which minimizes 
l|A0-y|| 2 , is the least-squares estimator (LSE) 0* : 

0* = (A T A)- 1 A T y, (8.34) 

where A~ is the transpose of A and (A T A) -1 A T is the pseudoinverse of A if A T A 
is nonsingular. Of course, we can also employ the recursive LSE formula introduced 
in Section 5.5 to calculate 0*. Specifically, let the ith row vector of matrix A defined 
in Equation (8.33) be aj and the ith element of y be yj\ then 0 can be calculated 
iteratively as follows: 


0i + 1 

Pi+1 


0i + P i+ ia i+ i{y? +1 - aj + i0i) 


P, - 


P t a i+ia/ +1 P i 


1 "1“ a i+lP t a i+l 


= 0 , 1 , 


,p - 1 




J 


(8.35) 
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where the least-squares estimator 0 * is equal to Op. The initial conditions needed 
to bootstrap Equation (8.35) are 0o = 0 and Po = 7I, where 7 is a positive large 
number and I is the identity matrix of dimension M x M. The effects of these 
initial conditions on the identification of 0* are described in Section 5.5. When we 
are dealing with adaptive networks with multiple outputs [o in Equation (8.30) is a 
column vector], Equation (8.35) still applies except that yj is the «th row of matrix 

y- 

Now we can combine steepest descent and the least-squares estimator to update 
the parameters in an adaptive network. For hybrid learning to be applied in a batch 
mode, each epoch is composed of a forward pass and a backward pass. In the 
forward pass, after an input vector is presented, we calculate the node outputs in 
the network layer by layer until a corresponding row in the matrices A and y in 
Equation (8.33) is obtained. This process is repreated for all the training data pairs 
to form the complete A and y; then parameters in S 2 are identified by either the 
pseudoinverse formula in Equation (8.34) or the recursive least-squares formulas 
in Equation (8.35). After the parameters in S 2 are identified, we can compute 
the error measure for each training data pair. In the backward pass, the error 
signals [the derivative of the error measure with respect to each node output, see 
Equations (8.4) and (8.5)] propagate from the output end toward the input end; 
the gradient vector is accumulated for each training data entry. At the end of the 
backward pass for all training data, the parameters in Si are updated by steepest 
descent in Equation (8.9). 

For given fixed values of the parameters in Si , the parameters in S 2 thus found 
are guaranteed to be the global optimum point in the S 2 parameter space because 
of the choice of the squared error measure. Not only can this hybrid learning rule 
decrease the dimension of the search space explored by the original steepest descent 
method, but, in general, it will also substantially reduce the time needed to reach 
convergence. However, sometimes the parameters in S 2 are concealed and we need 
a certain transformation method to recover them. In the following example, we use 
a multilayer perceptron with one hidden layer to explain how the linear parameters 
may be recovered. 

Example 8.8 Recovery of linear parameters in a multilayer perceptron 

For a single-hidden-layer perceptron with p output units, the o in Equation (8.30) 
is a column vector. If these output units are linear and there are no squashing 
functions to limit the output range, then the outputs are linear in the weights 
and thresholds of the output layer and these parameters can be identified by the 
least-squares method. On the other hand, if these output units have a sigmoidal 
activation function, then we can apply the inverse sigmoid function 

H(x) = In (r~) , (8.36) 

such that the H{o) in Equation (8.32) becomes a linear (vector) function in the 
parameters (weights and thresholds) of the output layer. In other words, 
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51 = weights and thresholds of hidden layer, 

5 2 = weights and thresholds of output layer. 

Therefore, we can apply backpropagation (or equivalently, steepest descent) to tune 
the parameters in the hidden layer, and the parameters in the output layer can be 
identified by the least-squares method. 


□ 

As mentioned in Section 5.8, however, it should be kept in mind that by using 
the least-squares method on the data transformed by #(•), the obtained parame- 
ters are optimal in terms of the transformed squared error measure instead of the 
original one. In practice, usually this does not cause a problem as long as H(-) is 
monotonically increasing and the training data are not too noisy. See Section 5.8 
for a more detailed treatment of other transformation methods. 


8.5.2 On-Line Learning (Pattern- By- Pattern Learning) 

If the parameters are updated after each data presentation, we have the scheme 
of on-line or pattern-by-pattern learning. This learning strategy is vital to on-line 
parameter identification for systems with changing characteristics. To modify the 
batch learning rule to obtain an on-line version, it is obvious that steepest descent 
should be based on E p [see Equation (8.6)] instead of E. Strictly speaking, this is 
not a truly gradient search procedure for minimizing E , yet it will approximate one 
if the learning rate is small. 

For the recursive least-squares formula to account for the time-varying charac- 
teristics of the incoming data, the effects of old data pairs must decay as new data 
pairs become available. Again, this problem is well studied in the adaptive control 
and system identification literature, and a number of solutions are available [2]. 
One simple method is to formulate the squared error measure as a weighted version 
that gives higher weighting factors to more recent data pairs. This amounts to the 
addition of a forgetting factor A to the original recursive formula: 


»+i = 9i +P i+ ia i+ i(y^, 1 ~aj +1 0i) 

_ 1 rp _ Pt a t+l a j+lPj i > > 

t+1 * A + af +1 P iai+1 J 


(8.37) 


where the typical value of A in practice is between 0.9 and 1. The smaller A is, 
the faster the effects of old data decay. A small A sometimes causes numerical 
instability, however, and thus should be avoided. [For a complete discussion and 
derivation of Equation (8.37), the reader is referred to Section 5.6 of Chapter 5.] 


8.5.3 Different Ways of Combining Steepest Descent and LSE 

The computational complexity of the least-squares estimator (LSE) is usually higher 
than that of steepest descent (SD) for one-step adaptation. However, for achieving 
a prescribed performance level, the LSE is usually much faster. Consequently, 
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depending on the available computing resources and required level of performance, 
we can choose from among at least five types of hybrid learning rules combining SD 
and LSE in different degrees, as follows: 

1. One pass of LSE only: Nonlinear parameters are fixed while linear parameters 
are identified by one-time application of LSE. 

2. SD only: All parameters are treated as nonlinear and updated by SD itera- 
tively. 

3. One pass of LSE followed by SD: LSE is employed only once at the beginning 
to obtain the initial values of linear parameters, and then SD takes over to 
update all parameters iteratively. 

4. SD and LSE: Linear and nonlinear parameters are distinguished first. Each 
iteration (epoch) of SD used to update the nonlinear parameters is followed 
by LSE to identify the linear parameters. 

5. LSE only: The outputs of an adaptive network are linearized with respect to 
its parameters, and then the extended Kalman filter algorithm, the Gauss- 
Newton method, or the Levenberg-Marquardt method is employed to update 
all parameters. These methods have also been used in the neural network 
literature [12, 13, 14]. (See also Section 6.8 of Chapter 6.) 

The choice of one of the foregoing methods should be based on a trade-off be- 
tween computational complexity and performance. Note that the linear parame- 
ters can also be updated by the Widrow-Hoff LMS algorithm [17], as reported in 
refs. [15]. The Widrow-Hoff algorithm requires less computation and is suitable for 
parallel hardware implementation, but it converges relatively slowly compared with 
the least-squares estimator. See also Section 9.3 of Chapter 9. 

8.6 SUMMARY 

This chapter describes the architectures and learning procedures of adaptive net- 
works, a unifying framework that subsumes all supervised learning neural networks 
(Chapter 9), such as perceptrons, Adalines, multilayer perceptrons, radial basis 
function networks and modular networks. Understanding adaptive networks also 
paves the avenue to neuro-fuzzy modeling paradigms, such as ANFIS and CANFIS, 
which are presented in Chapters 12 and 13, respectively. 


EXERCISES 

1. Finish Example 8.6 by giving the expressions of the error signals for nodes 1, 
2, 3, 4, and 5. 
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Figure 8.18. Adaptive networks used in the exercises. 


2. Figure 8.18(a) is a simple adaptive network in which the error measure E is 
defined as a function of X 2 and X 3 . (a) Give the formulas for each e*. (b) Draw 
the error-propagation network. 

3. In the previous exercise, suppose that E is expressed as E = f± (X 2 , X 3 ) and 
it is added as a new node to form the new network shown in Figure 8.18(b). 
Draw the corresponding error-propagation network. Is it the same as what you 
obtained in the previous exercise? 

4. Verify that the A in Equation (8.26) is equal to the determinant of the square 
matrix in Equation (8.22). 

5. Apply Mason’s gain formula to express € 4 , € 5 , and in terms of Ii and I 2 in 
Example 8.7. 

6 . Solve Equation (8.22) algebraically and compare the answer with the formulas 
obtained via Mason’s gain formula in the previous exercise. 
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Neural Networks 


Chapter 9 


J.-S. R. Jang and E. Mizutani 


9.1 INTRODUCTION 

Artificial neural networks, or simply neural networks (NNs), have been 
studied for more than three decades since Rosenblatt [47] first applied single-layer 
perceptrons to pattern classification learning in the late 1950s. However, because 
Minsky and Papert [33] pointed out that single-layer systems were limited and 
expressed pessimism over multilayer systems, interest in NNs dwindled in the 1970s. 
The recent resurgence of interest in the field of NNs has been inspired by new 
developments in NN learning algorithms [10, 40, 48, 59], analog VLSI (very large 
scale integrated) circuits, and parallel processing techniques [25]. 

Quite a few NN models have been proposed and investigated in recent years. 
These NN models can be classified according to various criteria, such as their learn- 
ing methods (supervised versus unsupervised), architectures (feedforward versus 
recurrent), output types (binary versus continuous), node types (uniform versus hy- 
brid), implementations (software versus hardware), connection weights (adjustable 
versus hardwired), operations (biologically motivated versus psychologically moti- 
vated), and so on. In this chapter, we confine our scope to modeling problems with 
desired input-output data sets, so the resulting networks must have adjustable pa- 
rameters that are updated by a supervised learning rule. Such networks are often 
referred to as supervised learning or mapping networks, since we are inter- 
ested in shaping the input-output mappings of the networks according to a given 
training data set. (For details on unsupervised learning networks that try to cluster 
a given data set, see the next chapter.) 
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(Fixed) 



Layer 

Figure 9.1. The perceptron. 



Figure 9.2. Introduction of the bias connection weight; the bias term wq (= — 9) 
can be viewed as the connection weight between the output unit and a “dummy” 
incoming signal xq that is always equal to 1 . 


9.2 PERCEPTRONS 

9.2.1 Architecture and Learning Rule 

The perceptron represents one of the early attempts to build intelligent and self- 
learning systems using simple components. It was derived from a biological brain 
neuron model introduced by McCulloch and Pitts [32] in 1943. Later, Rosen- 
blatt [47] designed the perceptron with a view toward explaining and modeling 
pattern-recognition abilities of biological visual systems. Although the goal is am- 
bitious, the network paradigm is simple. Figure 9.1 is a typical perceptron setup 
for pattern-recognition applications, in which visual patterns are represented as 
matrices of elements between 0 and 1. The first layer of the perceptron acts as a 
set of “feature detectors” that axe hardwired to the input signals to detect specific 
features. The second (output) layer takes the outputs of the “feature detectors” 
in the first layer and classifies the given input pattern. Learning is initiated by 
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making adjustments to the relevant connection strengths (i.e., weights Wi) and a 
threshold value 6. For a two-class problem (for instance, determining whether the 
given pattern in Figure 9.1 is a “P” or not), the output layer usually has only a 
single node. For an n-class problem with n greater than or equal to 3, the output 
layer usually has n nodes, each corresponding to a class, and the output node with 
the largest value indicates which class the input vector belongs to. 

Each function gi in layer 1 is a fixed function that has to be determined a priori ; 
it maps all or a part of the input pattern into a binary value Xi € {—1,1} or a 
bipolar value Xi 6 {0, 1}. The term Xi is referred to as active or excitatory if its 
value is 1, inactive if its value is 0, and inhibitory if its value is —1. The output 
unit is a linear threshold element with a threshold value 0: 

o = f (£ILi w i x i ~ 0) > 

= / (EiLl w i x i + Wo) , Wo = -0, (9.1) 

= f(J2i=0 w i x i) , *0 = 1- - 


where Wi is a modifiable weight associated with an incoming signal £*; and Wq (= 
— 0) is the bias term. For computational efficiency, we can introduce the bias con- 
nection weight wq in place of the threshold value 0; Equation (9.1) shows that the 
threshold can be viewed as the connection weight between the output unit and a 
“dummy” incoming signal xq that is always equal to 1, as illustrated in Figure 9.2. 
In Equation (9.1), /(•) is the activation function of the perceptron and it is 
typically either a signum function sgn(x) or step function step(x): 



if x > 0, 
otherwise, 


if x > 0, 
otherwise. 


Note that the “feature detector” gi can be any function of the input pattern, 
but the learning procedure only adjusts the connection weights to the output unit 
(in the last layer). Since only the weights leading to the last layer are modifiable, 
the perceptron in Figure 9.1 is usually treated as a single- layer perceptron. 
Starting with a set of random connection weights, the basic learning algorithm for 
a single-layer perceptron repeats the following steps until the weights converge: 


1. Select an input vector x from the training data set. 


2. If the perceptron gives an incorrect response, modify all connection weights 
Wi according to 


A Wi = TjtiXi, 


where U is a target output and 7/ is a learning rate. 
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The foregoing learning rule can be applied as well to updating a threshold 6 (= 
— wq) according to Equation (9.1). The value for the learning rate 7/ can be a 
constant throughout training; or it can be a varying quantity proportional to the 
error. An 7] that is proportional to the error usually leads to faster convergence but 
can cause unstable learning. 

The preceding learning algorithm above is roughly based on gradient descent; 
Rosenblatt [47] proved that there exists a method for tuning the weights that is 
guaranteed to converge to provide the required output if and only if such a set of 
weights exist. This is called the perceptron convergence theorem. Moreover, 
depending on the functions gi, perceptrons can be grouped into different families; 
a number of these families and their properties are described in refs. [47] . 

In the early 1960s, perceptrons created a great deal of interest and optimism 
directed toward building real self-learning intelligent systems. However, the initial 
enthusiasm waned after the publication of Minsky and Papert’s Perceptrons [33] in 
1969, in which they analyzed the perceptron extensively and concluded that single- 
layer perceptrons can only be used for toy problems. One of their most discouraging 
results shows that a single-layer perceptron cannot represent a simple exclusive-OR 
function, as explained next. 


9.2.2 Exclusive-OR Problem 


The simplest and most well-known pattern recognition problem in neural network 
literature is the exclusive-OR (XOR) problem. The task is to classify a binary 
input vector to class 0 if the vector has an even number of l’s, or assign it to class 1. 
For a two-input binary XOR problem, the desired behavior is regulated by a truth 
table: 



X 

Y 

Class 

Desired i/o pair 1 

0 

0 

0 

Desired i/o pair 2 

0 

1 

1 

Desired i/o pair 3 

1 

0 

1 

Desired i/o pair 4 

1 

1 

0 


A bipolar XOR problem is similarly defined except that all instances of 1 in the 
truth table are replaced with —1. 

The XOR problem is not linearly separable; this can easily be observed from 
the plot in Figure 9.3. In other words, we cannot use a single-layer perceptron 
[Figure 9.4(a)] to construct a straight line to partition the two-dimensional input 
space into two regions, each containing only data points of the same class. Symbol- 
ically, using a single-layer perceptron to solve this problem requires satisfying the 
following four inequalities: 


0xi«i+0xill2 

0 X W\ -b 1 X W2 

1 X Wi + 0 X W2 
1 X Wi 4- 1 X W2 


+ Wo < 0 S=> 

+ wo > 0 <=>- 

4- wo > 0 v=> 
+ wq < 0 <=> 


w 0 < 0, 

Wo > -U>2, 

W 0 > -Wi, 

Wo < —W\ — W 2 • 
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o: class 1 , x: class 2 


1 


0 


Figure 9.3. XOR problem. (MATLAB file: xordata.m) 
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Figure 9.4. Perceptrons for the two-input exclusive-OR problem : (a) the single- 
layer perceptron, and (b) the two-layer perceptron. Both use the step function as 
the activation function for each node. 


However, the set of inequalities is self-contradictory when considered as a whole. 

It is possible to solve the problem with the two-layer perceptron illustrated in 
Figure 9.4(b), in which the connection weights and thresholds are indicated. More 
specifically, we can plot the output of each neuron as a surface of its two inputs, as 
shown in Figures 9.5(a) through 9.5(d). Figure 9.5(d) is the overall input-output 
plot of the two-layer perceptron, indicating that it has solved the XOR problem. 

In summary, multilayer perceptrons can solve nonlinearly separable problems 
and are thus much more powerful than the original single-layer version. In Sec- 
tion 9.4, we discuss a learning method that can be used to find appropriate connec- 
tion weights and thresholds in multilayer perceptrons. 


9.3 ADALINE 

The adaptive linear element (or Adaline), suggested by Widrow and Hoff [62], 
represents a classical example of the simplest intelligent self-learning system that 
can adapt itself to achieve a given modeling task. Figure 9.6 is a schematic diagram 
for such a network. It has a purely linear output unit; hence the network output o 
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Figure 9.5. Node outputs as surfaces of their inputs in Figure 9.4: (a) x 3 ; (b) x 4 ; 

(c) £5 (as a function of x 3 and X4); (d) x 5 (as a function of x 1 and x 3 ). (MATLAB 
file: xorsurfl.m) 


is a weighted linear combination of the inputs plus a constant term: 

n 

° = ^2mxi + w 0 . (9.2) 

i - 1 

In a simple physical implementation, the input signals Xi are voltages and the Wi 
are conductances of controllable resistors; the network’s output is the summation 
of the currents caused by the input voltages. The problem is to find a suitable set 
of conductances (or weights) such that the input-output behavior of the Adaline is 
close to a set of desired input-output data points. 

It is obvious that the preceding Adaline equation is an exactly linear model with 
n -(- 1 linear parameters, so we can employ the least-squares methods introduced 
in Chapter 5 to minimize the error in the sense of least squares. However, most 
of the least-squares methods require extensive calculations, which are not possible 
in a physical system with simple components. To overcome this, Widrow and Hoff 
introduced the delta rule for adjusting the weights. For the pth input-output 
pattern, the error measure of a single-output Adaline can be expressed as 

Ep — ( t p — Op) , 
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where t p is the target output and o p is the actual output of the Adaline. The 
derivative of E p with respect to each weight Wi is 


dEp 

dwi 


2 (t p Op^Xi- 


Therefore, to decrease E p by gradient descent, the update formula for Wi on the 
pth input-output pattern is 


A p Wi = 7){t p - o p )xi. (9.3) 

This update formula has strong intuitive appeal. It essentially states that when 
t p > Op , we want to boost o p by increasing u^x*; therefore, we should increase Wi 
if Xi is positive and decrease Wi if x* is negative. Similar reasoning holds when 
tp < o p . Since the delta rule tries to minimize squared errors, it is also referred 
to as the least mean square (LMS) learning procedure or Widrow-Hoff 
learning rule. The features of the delta rule are as follows: 

• Simplicity: This is obvious from Equation (9.3). 

• Distributed learning: Learning is not reliant on central control of the network; 
it can be performed locally at each node level; see Equation (9.3). 

• On-line (or pattern-by-pattern) learning: Weights are updated after presen- 
tation of each pattern. 

These features make Adaline, with the delta rule, suitable for simple hardware 
implementation. 

In the 1960s, two or more Adaline components were integrated to develop Mada- 
line (many Adalines) in an attempt to implement nonlinearly separable logic func- 
tions. To solve the previously mentioned XOR problem, for instance, two Adalines 
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were connected to an AND logic device (Madaline unit) to provide an output [65]. 
However, the Adaline and Madaline systems were limited in that they had only one 
layer with adjustable weights, just like single-layer perceptrons. 

Adaline and Madline have been used for adaptive noise cancellation [61] and 
adaptive inverse control [64]. In adaptive noise cancellation, the objective is to 
filter out an interference component by identifying a linear model of a measurable 
noise source and the corresponding unmeasurable interference; such applications 
include interference canceling in electrocardiograms (ECGs), echo elimination from 
long-distance telephone transmission lines, and antenna sidelobe interference can- 
celing [64]; for more details on adaptive inverse control, see refs. [64] or Section 17.4 
of this text. For details on Adaline, Madline, and LMS methods, refer to refs. [64] 
and [63]. 

9.4 BACKPROPAGATION MULTILAYER PERCEPTRONS 

As mentioned earlier, the lack of suitable training methods for multilayer per- 
ceptrons (MLPs) led to a waning of interest in neural networks in the 1960s and 
1970s. This was not changed until the reformulation of the backpropagation 
training method for MLPs in the mid-1980s by Rumelhart et al. [48]. (The deriva- 
tion of the backpropagation method in a more general framework can be found in 
Section 8.3 of Chapter 8.) 

The single-layer perceptron discussed in Section 9.2 is a principal NN component 
and provides the grounds for current understanding and most applications of NNs. 
However, because of the nondifferentiability of the hard-limiter activation function, 
the learning strategies of early multilayer perceptrons with signum or step activation 
functions are not obvious unless continuous activation functions are employed. 

A backpropagation MLP, as already mentioned in Examples 8.3 and 8.4, is an 
adaptive network whose nodes (or neurons) perform the same function on incom- 
ing signals; this node function is usually a composite of the weighted sum and a 
differentiable nonlinear activation function, also known as the transfer function. 
Figure 9.7 depicts three of the most commonly used activation functions in back- 
propagation MLPs: 

Logistic function: f(x) = .. ^ 

-L 1 6 

*1 x 

Hyperbolic tangent function: f{x) = tanh(x/2) = ~ e ^ 

1 I ^ 

Identity function: f(x) = x. 

Both the hyperbolic tangent and logistic functions approximate the signum and 
step function, respectively, and yet provide smooth, nonzero derivatives with re- 
spect to input signals. Sometimes these two activation functions are referred to as 
squashing functions since the inputs to these functions are squashed to the range 
[0, 1] or [—1,1]. They are also called sigmoidal functions because their S-shaped 
curves exhibit smoothness and asymptotic properties. (Sometimes the hyperbolic 
tangent function are referred to as bipolar sigmoidal and the logistic function are 
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Logistic Function 



Hyperbolic Tangent Function Identity Function 



Figure 9.7. Activation functions for backpropagation MLPs: (a) logistic function; 
(b) hyperbolic tangent function; (c) identity function. (MATLAB file: activati.m) 


referred to as binary sigmoidal.) Both of these activations are used often on re- 
gression and classification problems. Other activation functions are discussed in 
Section 13.3.3 of Chapter 13. 

For a neural network to approximate a continuous-valued function not limited 
to the interval [0, 1] or [—1, 1], we usually let the node function for the output layer 
be a weighted sum with no squashing functions. This is equivalent to a situation in 
which the activation function is an identity function, and output nodes of this type 
are often called linear nodes. 

Backpropagation MLPs are by far the most commonly used NN structures for 
applications in a wide range of areas, such as pattern recognition, signal processing, 
data compression, and automatic control. Some of the well-known instances of 
applications include NETtalk [51, 52], which trained an MLP to pronounce English 
text, Carnegie Mellon University’s ALVINN (Autonomous Land Vehicle in a Neural 
Network) [42, 43], which used an MLP for steering an autonomous vehicle; and 
optical character recognition (OCR) [23, 49]. 


9.4.1 Backpropagation Learning Rule 

For simplicity, we assume that the backpropagation MLP in question uses the lo- 
gistic function as its activation function; the reader is encouraged to derive the 
backpropagation procedure when other types of continuous activation functions are 
used. 

The net input x of a node is defined as the weighted sum of the incoming signals 
plus a bias term. For instance, the net input and output of node j in Figure 9.8 are 


Xj = Yli W ij X i + W j, 

= r + i xic-xj) ' 1 


(9.4) 


where x* is the output of node i located in any one of the previous layers, Wij is 
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Layer 0 Layer 1 Layer 2 

(Input Layer) (Hidden Layer) (Output Layer) 


Figure 9.9. A 3-3-2 backpropagation MLP. 


the weight associated with the link connecting nodes i and j, and Wj is the bias 
of node j. Since the weights Wij are actually internal parameters associated with 
each node j, changing the weights of a node will alter the behavior of the node and 
in turn alter the behavior of the whole backpropagation MLP. Figure 9.9 shows a 
two-layer backpropagation MLP with three inputs to the input layer, three neurons 
in the hidden layer, and two output neurons in the output layer. For simplicity, 
this backpropagation MLP will be referred to as a 3-3-2 network, corresponding to 
the number of nodes in each layer. (Note that the input layer is composed of three 
buffer nodes for distributing the input signals; therefore, this layer is conventionally 
not counted as a physical layer of a backpropagation MLP.) 

The backward error propagation, also known as the backpropagation 
(BP) or the generalized delta rule (GDR), is explained next. First, a squared 
error measure for the pth input-output pair is defined as 


E p = ^(dfc — Xk ) 2 , 

k 


(9.5) 
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where d* is the desired output for node k , and Xk is the actual output for node k 
when the input part of the pth data pair is presented. To find the gradient vector, 
an error term e* for node i is defined as 




d + E p 

dxi 


By the chain rule, the recursive formula for e* can be written as 


(9.6) 


Ci 


dx 


-2 (di - = -2 (di - Xi)xi{ 1 - Xi) 

d it = dx? ~&Xi ~ Xi( ^ ~ Xi ) e i Wi j 


if node i is a output node, 
otherwise, 


(9.7) 

where Wij is the connection weight from node i to j; and Wij is zero if there is no 
direct connection. Then the weight update Wki for on-line (pattern-by-pattern) 
learning is 


. d + E p d + E p dxi 

Aiyfci = -T]— v - = 

UWki dX{ OWfci 




(9.8) 


where rj is a learning rate that affects the convergence speed and stability of the 
weights during learning. The update formula for the bias of each node can be 
derived similarly. 

For off-line (batch) learning, the connection weight Wki is updated only after 
presentation of the entire data set, or only after an epoch: 


A - -V 


d+E 

dw ki 



(9.9) 


or, in vector form, 


Aw = —T) 


d + E 

<9w 


-rjV w E, 


(9.10) 


where E = Yh p E p - This corresponds to a way of using the true gradient direction 
based on the entire data set. 


9.4.2 Methods of Speeding Up MLP Training 

Quite a few ad hoc methods exist to speed up MLP backpropagation training. Some 
of them are applicable to general backpropagation gradient descent, while others 
are only effective for MLP backpropagation. 

One way to speed up off-line training is to use the so-called momentum term [48]: 

Aw = -t]V w E + aAwprev, (9.11) 

where wprev is the previous update amount, and the momentum constant a, in 
practice, is usually set to something between 0.1 and 1. The addition of the mo- 
mentum term smoothes weight updating and tends to resist erratic weight changes 
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due to gradient noise or high spatial frequencies in the error surface. However, the 
use of momentum terms does not always seem to speed up training; it is more or 
less application dependent [63], 

Another useful technique is normalized weight updating: 


Aw = — K 


Vw E 

IIVwBII 


(9.12) 


This causes the network’s weight vector to move the same Euclidean distance k in 
the weight space with each update, which allows control of the distance k based 
on the history of error measures. One of the adaptation strategies for varying the 
step size k is explained in Section 6.7.2 of Chapter 6. Other methods for speeding 
up MLP backpropagation training include the quick-propagation algorithm [9], the 
delta-bar-delta approach [14], the extended Kalman filter method [54, 55], second- 
order optimization [58], and the optimal filtering approach [53]. 

Generally, an MLP with hyperbolic tangent functions can be trained more 
rapidly than one with logistic functions. This is only an empirical observation and 
it has exceptions (for instance, the encoding problem in ref. [9]), but it is advisable 
to try both types of MLP when encountering a new application. 

Sometimes it is desirable to do data scaling on the raw training data and 
then use the processed data to train the MLP. In output scaling, the range of 
target values is constrained to remain within the range of the sigmoidal activation 
function. For instance, for an MLP with hyperbolic tangent functions, the target 
values must be within, say, [—0.9, 0.9], instead of within the usual activation function 
range [—1, 1]. This prevents backpropagation from driving some of the connection 
weights to infinity and slowing down training. A similar approach using modified 
sigmoid functions is discussed in Section 13.3.3. In input scaling, the range of 
each input is scaled (usually linearly) to the range of the activation function used. 
This scaling allows the connection weights to have the same order of magnitude 
during training. 

The initial values of the connection weights and biases in an MLP should be 
uniformly distributed across a small range, usually [—1,1]. If the initial values 
of these modifiable parameters are too large, then some of the neurons might get 
saturated and produce small error signals. On the other hand, if the range is too 
small, then the gradient vector is also small and learning will be very slow initially. 
Note that when all the free parameters are zeros, the gradient vector is always zero 
since it happens to be a saddle point in the error landscape. Another and better 
way to initialize network parameters is to choose the weight and bias values such 
that the “slope” parts of neurons in the hidden layer can cover the input space; see 
ref. [37] for details. 

All neurons in an MLP should get updated at approximately the same rate. 
However, the error signals at the the output layer tend to be larger than those at 
the front-end layer of the network. This can be seen directly in Equation (9.7), 
where Xi{\ — Xi) appears once in the output layer, twice in the layer next to the 
output, and so on. Note that the term ar* (1 — Xi ) is always less than or equal to 
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0.25 (see dashed lines in Figure 13.8), so the error signals at front-end layers tend 
to be smaller due to multiple multiplication of this term. Therefore, the learning 
rate 77 of the front-end layers should be larger than that of the output layer. This 
is called the learning rate rescaling; see ref. [46] for details. 


9.4.3 MLP’s Approximation Power 

The approximation power of backpropagation MLPs has been explored by some 
researchers. Yet there is very little theoretical guidance for determining network 
size in terms of, say, the number of hidden nodes and hidden layers it should con- 
tain. Cybenko [8] showed that a backpropagation MLP, with one hidden layers 
and any fixed continuous sigmoidal nonlinear function, can approximate any con- 
tinuous function arbitrarily well on a compact set. When used as a binary- valued 
neural network with the hard-limiter (step) activation function, a backpropagation 
MLP with two hidden layers can form arbitrary complex decision regions to sepa- 
rate different classes, as Lippmann [25] pointed out. For function approximation as 
well as data classification, two hidden layers may be required to learn a piecewise- 
continuous function [29]. In their book, Hertz et al. [13] introduced an intuitive 
explanation that MLPs with two hidden layers may be able to construct localized 
receptive fields out of logistic functions. Thus, two-layer MLPs may have abilities 
comparable to radial basis function networks, which are discussed next. 


9.5 RADIAL BASIS FUNCTION NETWORKS 
9.5.1 Architectures and Learning Methods 

Locally tuned and overlapping receptive fields are well-known structures that have 
been studied in regions of the cerebral cortex, the visual cortex, and others. Drawing 
on knowledge of biological receptive fields, Moody and Darken [34, 35] proposed a 
network structure that employs local receptive fields to perform function mappings. 
Similar schemes have been proposed by Powell [44], Broomhead and Lowe [4], and 
many others in the areas of interpolation and approximation theory; these 
schemes are collectively called radial basis function approximations. Here we shall 
call the network structure the radial basis function network or RBFN. 

Figure 9.10(a) shows a schematic diagram of an RBFN with four receptive field 
units; the activation level of the ith receptive field unit (or hidden unit) is 

Wi = Ri{x) = Ri { ||x - Ui||/(Ji), * = 1, 2, . . . , if, (9.13) 

where x is a multidimensional input vector, u* is a vector with the same dimension 
as x, if is the number of radial basis functions (or, equivalently, receptive field 
units), and Ri(-) is the ith radial basis function with a single maximum at the 
origin. There are no connection weights between the input layer and the hidden 
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layer. Typically, Ri(-) is a Gaussian function 


Ri(x) = exp 


x - Ui 


la} 


2 


(9.14) 


or a logistic function 


#i(x) 


1 

l + exp[||x-u,|| 2 /cr t 2 ]' 


(9.15) 


Thus, the activation level of radial basis function computed by the «th hidden 
unit is maximum when the input vector x is at the center u* of that unit. 

The output of an RBFN can be computed in two ways. In the simpler method, 
as shown in Figure 9.10(a), the final output is the weighted sum of the output 
value associated with each receptive field: 


H 

d( x ) = ^ °i W i = 

i=l 


H 

J2ciRi(x), 
i= 1 


(9.16) 


where c* is the output value associated with the zth receptive field. We can also view 
Ci as the connection weight between the receptive field i and the output unit. A 
more complicated method for calculating the overall output is to take the weighted 
average of the output associated with each receptive field: 


d(x) 



Ez^i Cifli(x) 
Ezli #i(x) 


(9.17) 


Weighted average has a higher degree of computational complexity, but it is advan- 
tageous in that points in the areas of overlap between two or more receptive fields 
will have a well-interpolated overall output between the outputs of the overlapping 
receptive fields. An example is presented in Section 9.5.4. 

For representation purposes, if we change the radial basis function Ri(x) in each 
node of layer 2 in Figure 9.10(a) to its normalized counterpart Ri(x)/ Y^ i R i (x), 
then the overall output is specified by Equation (9.17). A more explicit representa- 
tion is shown in Figure 9.10(b), where the division of the weighted sum (^ CiWi) 
by the activation total (^ Wi) is indicated in the division node in the last layer. 
In Figure 9.10, plots (c) and (d) are the two-output counterparts of the RBFNs in 
(a) and (b). 

Moody-Darken’s RBFN may be extended by assigning a linear function to the 
output function of each receptive field — that is, making Cj a linear combination of 
the input variables plus a constant: 


Ci = afx + &i, (9.18) 

where a* is a parameter vector and 6* is a scalar parameter. Stokbro et al. [57] 
used this structure to model the Mackey-Glass chaotic time series [27] and found 
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Figure 9.10. Four RBFNs that possess four basis functions: (a) single-output 
RBFN that uses weighted sum; (b) single-output RBFN that uses weighted aver- 
age; (c) two-output RBFN that uses weighted sum; (d) two-output RBFN that uses 
weighted average. The network in (d) is equivalent to Figure 13.3 (upper right). 
[Note that in (b) and (d), four connections to the lower summation unit are omitted 
for simplicity .] 


that this extended version performed better than the original RBFN with the same 
number of fitting parameters. 

An RBFN’s approximation capacity may be further improved with supervised 
adjustments of the center and shape of the receptive field (or radial basis) func- 
tions [24, 60]. Several learning algorithms have been proposed to identify the pa- 
rameters (uj, <7j, and Ci) of an RBFN. Besides using a supervised learning scheme 
alone to update all modifiable parameters, a variety of sequential training algo- 
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rithms for RBFNs have been reported. The receptive field functions are first fixed, 
and then the weights of the output layer are adjusted. Several schemes have been 
proposed to determine the center positions (iij) of the receptive field functions. 
Lowe [26] proposed a way to determine the centers based on standard deviations 
of training data. Moody and Darken [34, 35] selected the centers u* by means of 
data clustering techniques (see Chapter 15) that assume that similar input vectors 
produce similar outputs; a*’ s are then obtained heuristically by taking the average 
distance to the several nearest neighbors of Ui s. In another variation, Nowlan [38] 
employed the so-called soft competition among Gaussian hidden units to locate the 
centers. This soft competition method is based on a “maximum likelihood estimate” 
for the centers, in contrast to the so-called hard competitions such as the fc-means 
winner-take-all algorithm. 

Once these nonlinear parameters are fixed and the receptive fields are frozen, the 
linear parameters (i.e., the weights of the output layer) can be updated using either 
the least-squares method or the gradient method. Alternatively, we can apply the 
pseudoinverse method in solving Equation (9.23) to determine these weights [5]. 
Chen et al. [6] used another method that employs the orthogonal least-squares 
algorithm to determine the Uj’s and Cj’s while keeping the a^s at predetermined 
values. There are many other schemes as well, such as generalization properties [2], 
and sequential adaptation [22], among others [19, 36]. 

9.5.2 Functional Equivalence to FIS 

An extension of the originally proposed Moody-Darken’s RBFN is to assign a linear 
function as the output function of each receptive field; that is, C* is a linear function 
of the input variables instead of a constant: 

Ci=ai-x + bi, (9.19) 

where Si is a parameter vector and b{ is a scalar parameter. Stokbro et al. [57] 
used this structure to model the Mackey-Glass chaotic time series [27] and found 
that this extended version performed better than the originally proposed RBFN 
with the same number of fitting parameters. Using Equation (9.19), the extended 
RBFN response given by Equation (9.16) or Equation (9.17) is identical to the 
response produced by the first-order Sugeno fuzzy inference system (FIS) discussed 
in Chapter 4, provided that the membership functions, the radial basis functions, 
and certain operators are choose correctly. 

While the RBFN consists of radial basis functions, the FIS comprises a certain 
number of membership functions. With those radially shaped functions, both FIS 
and RBFN have a mechanism whereby they can produce a center- weighted response 
to small receptive fields, localizing the primary input excitation. Although the FIS 
and the RBFN were developed on different bases, they are essentially rooted in 
the same soil. Just as the RBFN enjoys quick convergence, the FIS can evolve to 
recognize some feature in a training data set quickly compared with simple back- 
propagation MLPs (see Chapter 12). 
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The conditions under which an RBFN and a FIS are functionally equivalent are 
summarized as follows [18]. 

• Both the RBFN and the FIS under consideration use the same aggregation 
method (namely, either weighted average or weighted sum) to derive their 
overall outputs. 

• The number of receptive field units in the RBFN is equal to the number of 
fuzzy if-then rules in the FIS. 

• Each radial basis function of the RBFN is equal to a multidimensional com- 
posite MF of the premise part of a fuzzy rule in the FIS. One way to achieve 
this is to use Gaussian MFs with the same variance in a fuzzy rule, and apply 
product to calculate the firing strength. The multiplication of these Gaussian 
MFs becomes a multidimensional Gaussian function — a radial basis function 
in RBFN. (See Figure 13.3 for more details.) 

• Corresponding radial basis function and fuzzy rule should have the same re- 
sponse function. That is, they should have the same constant terms (for 
the original RBFN and zero-order Sugeno FIS) or linear equations (for the 
extended RBFN and first-order Sugeno FIS).. 

The functional equivalence between FIS and RBFN cross-fertilizes both com- 
puting paradigms. Further details are presented in Section 12.4. 


9.5.3 Interpolation and Approximation RBFNs 

Assuming that there is no noise in the training data set, we need to estimate a 
function d(-) that yields exact desired outputs for all training data. This task is 
usually called an “interpolation” problem, and the resultant function d(-) should 
pass through all of the training data points. When we use an RBFN with the 
same number of basis functions as we have training patterns, we have a so-called 
interpolation RBFN, where each neuron in the hidden layer reponds to one 
particular training input pattern. 

Consider a Gaussian basis function centered at it* with a width parameter a: 


Wi = Ri ( ||x - iii||) = exp 


(x - u i) 


21 


2o? 


(9.20) 


Each training input Xi serves as a center for for the basis function, R{. Thus, from 
Equation (9.16), we have a Gaussian interpolation RBFN: 


n 

d(x) = ^ Ci exp 

i-1 


(x-Xi ) 2 

2<7 2 


(9.21) 


For given < 7 *, i = 1, . . . ,n, we obtain the following n simultaneous linear Equations 
for the n unknown weight coefficients, c*: 
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di = ci exp 
^2 = Ci exp 


Xi - xi 2 ' 
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-1 1 - c n exp 
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Writing them in matrix form, we obtain 

llxi-xill 2 ' 


di 

d>2 

d n 


exp 

exp 

exp 


2<rf~ 
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exp 
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C 2 
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Rewriting the preceding in a compact form, we have 

D = GC, 


where 


7 1 T 

D [di , d>2 , • • • , dn] , C — [ Cl , C 2 , • • • , Cfi ] 

When the matrix G is nonsingulax, we have a unique solution: 

C = G _1 D, 


(9.22) 


(9.23) 


(9.24) 


where G -1 denotes the inverse matrix of G. In practice, however, G may be ill- 
conditioned (close to singularity), especially when the training set is large. The 
regularization approach [12, 41] allows for such cases by modifying G to G + AI, 
where A is a positive real number (regularization parameter) and I is the identity 
matrix. (Similar modifications can be found in the Levenberg-Marquardt method to 
make the Hessian matrix positive definite; see Chapter 6 or refer to refs. [28, 30, 50].) 
Poggio and Girosi [41] took this approach in implementing a regularization network. 
Specht [56] presents a general regression neural network (GRNN) using a (nonlin- 
early) weighted average of the given training samples; the GRNN can be described 
by a topology identical to the RBFN with weighted average in Figure 9.10(b). 

When there are fewer basis functions than there are available training samples, 
an initial guess is required to determine their center positions. We then have an 
approximation RBFN. In this case, matrix G is not square and the least-squares 
methods described in Chapter 5 are commonly used to find the matrix C in Equa- 
tion (9.23). Note that the matrix G may be ill-conditioned and limited precision 
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may be encountered even when the pseudoinverse can be approximated by singular 
value decomposition [45]. 

Thorough discussions of both interpolation and approximation nets based on 
Poggio’s work [41] can be found in ref. [66]. 

9.5.4 Examples 

RBFN Fitting 

Figures 9.11 and 9.12 provide the results of a small interpolation problem with five 
data points. We tested two interpolation RBFNs: the RBFN with Gaussian basis 
functions in Equation (9.20), and an RBFN with exponential basis functions: 

Wi = Ri(||x - Xi||) = exp(— <t||x - Xi||), (9.25) 

where we set a to 1.0 for both the exponential and the Gaussian basis functions. 
The presented results were obtained by solving the matrix Equation (9.24), so 
the resultant interpolation curves pass through all training data points. In Fig- 
ures 9.11 and 9.12, the results based on Equations (9.16) (weighted sum) and 
(9.17) (weighted average) are displayed in tandem with the five basis functions; 
when the basis functions do not have enough overlap, the weighted sum of the hid- 
den outputs based on Equation (9.16) may generate curves that are not smooth 
enough. (Notice that the first two basis functions do not have enough overlap.) 
Output normalization (or weighted average) surely helps the normalized basis func- 
tions cover the input space, and therefore may lead to smoother curves. Notice 
that we cannot determine which curve provides the best values for intermediate 
points because we have no certain knowledge of any points other than the five given 
data points. However, in terms of smoothness, weight average does provides better 
performance than weighted sum. 

Polynomial Fitting 

For comparison purposes, Figure 9.13 shows the results obtained by the following 
four methods: linear interpolation, cubic spline interpolation, fourth-order poly- 
nomial interpolation, and third-order polynomial approximation. All the methods 
except for the first one can generate smooth curves. However, it is cumbersome to 
extend spline and polynomial interpolation to high-dimensional data sets that can 
be handled by the interpolation RBFN easily. 

Backpropagation MLP Fitting 

Figure 9.14 presents the results obtained by a backpropagation MLP. In the MLP 
results, six choices for the number of hidden units are considered to see how the 
overall responses are affected. (With respect to bias and variance in connection 
with the number of hidden units, refer to ref. [11]; it has been reported that an NN 
with few hidden neurons may exhibit high bias whereas an NN with many hidden 
neurons may have high variance. We did not examine bias and variance in this 
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Interpolation with Weighted Sum 




Interpolation with Weighted Average Normalized Gaussian MFs 




Figure 9.11. Interpolation results obtained by a Gaussian interpolation network. 

Interpolation with Weighted Sum Exponential MFs 




Interpolation with Weighted Average 


Normalized Exponential MF: 




Figure 9.12. Interpolation results obtained by an interpolation network with the 
exponential basis functions defined in Equation (9.25). 


simulation.) The curves generated by MLP are quite smooth and they all pass the 
five given data points. However, the positions of these curves between data points 
are strongly influenced by the initial weights of the MLP. Moreover, as more hidden 
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Linear Interpolation Cubic Spline Interpolation 



Fourth-order Polynomial Interpolation Third-order Polynomial Approximation 



Figure 9.13. Interpolation and approximation results obtained by four polynomial 
interpolation/ approximation methods. 


nodes axe used, the variations of curves between data points are more pronounced. 

Furthermore, to show how the MLP evolution proceeds to fit the five data points, 
we illustrate the MLP outputs at eight distinct error levels (or learning stages) from 
(a) to (h) in Figure 9.15; highlighted lines show the output responses of an MLP with 
two hidden units. The MLP was trained by the simple steepest-descent learning 
scheme. By observing the lines, we see that at least two data points are always 
passed through after the learning stage (c). That is, the MLP almost finished 
learning two of the five patterns at error level (c); the MLP then tried to learn the 
rest. This learning habit also depends on the initial weights. Therefore, it is usually 
recommended to test the MLP several times by using different sets of weights. 

9.6 MODULAR NETWORKS 

This section presents a particular class of modular networks, which have a hi- 
erarchical organization comprising multiple neural networks; the architecture ba- 
sically consists of two principal components: local experts and an integrating 
unit (or expert networks and a gating network, if they are expressed in the 
network form), as illustrated in Figure 9.16. A variety of modular connectionist 
architectures has been discussed, and thus such diverse names as committees of 
networks, adaptive mixtures, and hierarchical mixtures of experts have all 
been mentioned. (For simplicity, we shall call a collection of such variations modular 
networks.) 
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Simple Packpropagation MLP Approach 



Figure 9.14. Interpolation results obtained by simple backpropagation MLPs. 
“# HU” denotes “ number of hidden units. ” 


In general, the basic concept resides in the idea that combined (or averaged) 
estimators may be able to exceed the limitation of a single estimator. Clemen has 
shown that the principle of combining a certain number of estimators has a long 
history and he cited more than 200 papers in his review [7]. The idea also shares 
conceptual links with the divide-and-conquer methodology. Divide-and-conquer 
algorithms attack a complex problem by dividing it into simpler problems whose 
solutions can be combined to yield a solution to the complex problem [20]. In other 
words, the central idea is task decomposition. When using a modular network, 
a given task is split up among several local expert NNs. The average load on each 
NN is reduced in comparison with a single NN that must learn the entire original 
task, and thus the combined model may be able to surpass the limitation of a single 
NN. The outputs of a certain number of local experts (Oi) are mediated by an 
integrating unit. The integrating unit puts those outputs together using estimated 
combination weights ( gi ). The overall output Y of the modular network is given by 

K 

li = ]T 9iOi. 

i— 1 

The task decomposition idea can be also found in FIS because the outputs are me- 
diated by fuzzy membership functions in a similar way. Thus, the Sugeno-type FIS 



248 


Supervised Learning Neural Networks Ch. 9 



Figure 9.15. Backpropagation MLP approximations of five sample data at the eight 
distinct learning stages: (a) to (h). Highlighted lines show the output responses of 
an MLP based on the simple backpropagation learning scheme. 


discussed in Chapter 4 can be viewed as a variation of the modular network illus- 
trated in Figure 9.16 if each local expert is expressed in the linear function defined 
in Equation (9.19), and the integrating unit is replaced with a fuzzy membership 
value generator. (See Figure 13.6 for more details.) 

Nowlan, Jacobs, Hinton, and Jordan [15, 17, 38, 39] described modular networks 
from a competitive mixture perspective, which is surveyed in ref. [12]. That is, in the 
gating network, they used the softmax activation function, which was introduced 
by McCullagh and Nelder [31] and Bridle [3]. More precisely, the gating network 
uses a softmax activation gi of the ith output unit given by 


_ exp (kui) 

9l Hj exp (kuj) ’ 


(9.26) 


where U{ is the weighted sum of the inputs flowing to the ith output neuron of 
the gating network. It is illustrated in Figure 9.17. Use of the softmax activation 
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function in modular networks provides a sort of “competitive” mixing perspective 
because the *th local expert’s output O* with a minor activation Ui does not have 
a great impact on the overall output Yi due to Equation (9.26). A feature of a 
modular network approach, for example, has been discussed from the “competitive” 
standpoint [12, 16], by using a trivial task example that requires a fit to the pointed 
corner of a discontinuous (piecewise-linear) function g(x): 


9{x) 


x, if x > 0, 
—x, if x < 0. 


It is claimed that it would be preferable to 

• split that function into two separate pieces, 

• use two local expert NNs to learn each piece separately in the modular con- 
struction, and 

• combine the two experts’ outputs by using the values given by softmax func- 
tions. 


This claim conforms with the “competitive” mixing idea. In contrast, fuzzy mem- 
bership functions in FIS attempt to split the task into pieces more softly than the 
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Figure 9.17. The gating network. The dotted rectangle shows the softmax activa- 
tion function defined in Equation (9.26). 


softmax functions. That is, FIS stands on a sort of “complementary” mixture view- 
point due to the weighted average of fuzzy membership functions’ outputs. As we 
will see in Section 13.5 of Chapter 13, adaptive FIS (that is, ANFIS and CANFIS 
described in Chapters 12 and 13) learn to fit such a pointed corner in different ways 
from modular networks with softmax functions. The results shown in Section 13.5 
also suggest that RBFN with linear functions of the input variables discussed in 
Section 9.5.2 may be able to fit well discontinuous functions. 

Jordan and Jacobs [20] and Jordan and Xu [21] discussed an application of the 
expectation-maximization (EM) algorithm for maximum likelihood estimation [1] 
to train their modular networks. 

9.7 SUMMARY 

This chapter discusses supervised learning neural networks, including percep- 
trons, Adalines, backpropagation multilayer perceptrons, radial basis function net- 
works, and modular neural networks. These networks employ optimization tech- 
niques (or learning rules) to fine-tune their parameters to match a given data set 
produced by a target system to be modeled. Since the data set always contains 
desired outputs to be reproduced by the networks, the underlying learning rule is 
referred to as supervised , as compared to recording, reinforcement, and unsuper- 
vised learning, which are discussed in other chapters. Table 9.1 charts our voyage 
into other types of neural network learning rules. 
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Table 9.1. Learning modes of artificial neural networks and relevant chapters. 


Learning 

mode 

Characteristics of available 
information for learning 

Relevant 

chapters 

Supervised 

Instructive information on desired responses, 
explicitly specified by a teacher 

9 

Recording 

A priori design information for memory storing 

11 

Reinforcement 

Partial information about desired responses, or 
only “right” or “wrong,” evaluative information 

10 

Unsupervised 

No information about desired responses 

11 



Figure 9.18. (a) A classification problem, and (b) the perceptron designed to solve 
the problem. 


EXERCISES 


1. Finish designing the perceptron with step-function threshold units in Fig- 
ure 9.18 by determining threshold values 0* (i = 1, 2, 3) and six connection 
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weights between the input layer and the hidden layer that enable your percep- 
tron to recognize correctly whether an arbitrary point (x, y ) is in the shaded 
triangular area T or not. (Note that the points on the three dividing bound- 
ary lines are considered to be outside area T.) The perceptron is supposed to 
produce the output o: 


o = 


{ 


1 

0 


if (*,y)€T, 
otherwise. 


2. Demonstrate that the backpropagation learning rule for an MLP in Equa- 
tion (9.7) can be derived from either Equation (8.5) or (8.12). 

3. Modify the backpropagation learning rule in Equation (9.7) to accommodate 
making the hyperbolic tangent function the activation function. 

4. The MATLAB function tanmlp .m (available via FTP, see page xxiii) is an imple- 
mentation of the backpropagation MLP with the hyperbolic tangent function 

X - X 

f(x) = tanh(rc) = - ~ ~ e _ x - as the network’s activation function. Note that 

6 I 6 

tanmlp . m uses batch learning, but the code is fully vectorized and there is no 
for-loop to cycle through each training data pair. Verify that Equation (9.7) is 
correctly implemented in tanmlp. m in matrix forms. 

5. Modify tanmlp. m to get logmlp.m for the backpropagation MLP with the lo- 
gistic function f(x) = - 

1 + e 1 

6. Compare the training speed of tanmlp. m and logmlp.m for the bipolar and 
binary XOR problems, respectively. (You should run each program at least 10 
times, with each run including up to 500 epochs, to get enough error curves to 
make a fair comparison.) 

7. Set the momentum term a to zero and perform 10 runs (each including up 
to 500 epochs) of tanmlp. m for the bipolar XOR problem to plot the average 
RMSE (root-mean-squared error) curve. Repeat the same task, but use a 
couple of nonzero momentum terms. Does the use of momentum terms speed 
up training? What is the best value of a in your simulation? 

8. Explain why an MLP does not learn if the initial weights and biases are all 
zeros. 

9. Use tanmlp. m to solve a three-input bipolar XOR problem with a 3-ra-l back- 
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propagation MLP. The training data matrix is 


-1 

-1 

-1 

-1 

-1 

-1 

1 

1 

-1 

1 

-1 

1 

-1 

1 

1 

-1 

1 

-1 

-1 

1 

1 

-1 

1 

-1 

1 

1 

-1 

-1 

1 

1 

1 

1 


where the first three columns are inputs, the last column is output, and each 
row represents a desired input-output data pair. Do some experiments to find 
the smallest n (number of hidden units) needed to solve the problem. 

10. Write a MATLAB script that uses an interpolation RBFN to solve the two-input 
bipolar XOR problem. Plot the overall input-output surface. 

11. Derive the backpropagation learning rule for a single-output RBFN, where the 
parameters include the center (iij), the width (cr*), and the connection weight 
(cj) for each receptive field. 

12. Implement a MATLAB function rbfn.m to do backpropagation in a single- 
output RBFN. You should follow the input argument convention of tanmlp.m. 
Try not to use for-loops to cycle through the data set; use vectorized operations 
instead to speed up computation. 
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Chapter 10 


Learning from Reinforcement 1 


E. Mizutani 


10.1 INTRODUCTION 

Reinforcement learning has been accepted as a fundamental paradigm for ma- 
chine learning with particular emphasis on computational aspects of learning. 

Learning from reinforcement is a trial-and-error learning scheme whereby a com- 
putational agent learns to perform an appropriate action by receiving evaluative 
feedback (called a reinforcement signal, performance score, grade, etc.) through 
interaction with the world (or environment) that includes no explicit teacher for 
any correct instruction. The learning agent (or learner) reinforces itself from lesson 
failures; accumulated fiascos will lead it to success but not to disaster. This learning 
method can be considered a simple way of adjusting behavior, and can be found 
in animals’ learning skills and coping with a physical environment. It also matches 
our common-sense ideas [7, 76]: 

If an action is followed by a satisfactory state of affairs, or an improve- 
ment in the state of affairs, then the tendency to produce that action 
is strengthened or reinforced (rewarded). Otherwise, that tendency is 
weakened or inhibited (penalized). 

There is a vast body of literature on reinforcement learning, also known as 
graded learning. It has been discussed in game playing, autonomous robot control, 
and so on. Commonly applied reinforcement learning strategies are classified into 
a few categories from the architectural standpoint; Sutton delineates four basic 
representative architectures for reinforcement learning [74]: 

• Policy-only 

1 The content of this chapter is largely attributable to Professor Stuart E. Dreyfus at the 
Department of Industrial Engineering and Operations Research (IEOR), University of California 
at Berkeley. Some of the materials came from the work in his class IEOR 290N (Artificial Neural 
Networks), spring 1994. Of course, any errors and mistakes are our own. 
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• Reinforcement comparison 

• Adaptive heuristic critic 

• Q-learning. 

This chapter explores selected topics in reinforcement learning. Besides the afore- 
mentioned four learning architectures, we describe the temporal difference notion 
and dynamic programming principle, and discuss their computational learning pow- 
ers. 

10.2 FAILURE IS THE SUREST PATH TO SUCCESS 

An agent usually learns to map a representation of a state to an appropriate action 
or a probability distribution over a set of actions. This mapping is called the policy. 
The simplest reinforcement learning architecture consists solely of an adjustable 
policy, which is called a policy-only architecture [74]. In this section, we first 
show a simple form of reinforcement learning based on a policy-only architecture 
that closely resembles ordinary supervised learning. We outline it by means of 
an example in the next subsection. After that, we discuss credit assignment and 
evaluation functions in light of the example. They are fundamental and important 
elements in reinforcement learning. 

10.2.1 Jackpot Journey 

The objective of reinforcement learning is to find an optimal policy for selecting a 
series of actions by means of a reward-penalty scheme. We consider application of 
an elementary reward-penalty scheme to a jackpot journey problem that requires 
finding an optimal path to “gold” in the simple triangular-path network illustrated 
in Figure 10.1. 

We imagine that many travelers start their jackpot journey, wishing to find 
gold from the starting point A. [Hereafter we call a path intersection (point) of our 
network a vertex.] At each vertex, there is a signpost that has a box with some 
white and black stones in it. A traveler picks a stone from the signpost box and 
follows certain instructions; when a “white” stone is picked, “go diagonally upward,” 
denoted by action u. Conversely, when a “black” stone is chosen, “go diagonally 
downward,” signified by action d. 

Suppose that the journey is always started at vertex A, and gold is placed at 
the terminal vertex H. Further suppose that travelers strictly follow the signpost 
instructions. 

Each traveler’s behavior can be described as follows: At vertex A, pick one stone 
(selection) and put it on the signpost; according to the stone’s instruction, proceed 
to the next vertex (action). Repeat this selection-action procedure at the second 
and third vertices. (After the third action, the traveler will reach one of the four 
terminal vertices: G, H, I, or J.) When the jackpot is hit (success), prepare a reward 
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Figure 10.1. The jackpot journey problem. 


scheme. When the gold is not found (failure), prepare a. penalty scheme. Then trace 
back to the starting vertex A; at each visited vertex, apply the reward or penalty 
scheme. That is, put the placed stone back into the signpost box with an additional 
stone of the same color (reward), or take the placed stone away from the signpost 
(penalty). (When the traveler returns, the next traveler will hit the road with a bit 
more hope.) Repeat the same journey many times. Obviously, the probability of 
finding an optimal policy 2 will increase as more and more journeys are undertaken. 

In conformity with the general terminology, we can tabulate the relevant terms 
as follows: 


path network configuration 

traveler 
vertex (intersection) 
picking a stone 
sequence of three stones 


world or environment 
«-> agent or learner 
«-> state 

selecting an action 
«-> policy 2 (or trajectory). 


It is easy to simulate such voyages on computers. We define the probability of 
action d, ^down , for each state as follows: 

Pj = Arum blafJc 

down Num Uack + Num wbite ’ 

2 We use the term policy, which is usually employed for a decision-making problem under 
uncertainty [26] such as a stochastic problem, in which the agent makes a decision first and 
then is informed by the environment which action to follow. [See Equation (10.13).] Here, the 
term path or trajectory may be more appropriate because this jackpot journey is a deterministic 
problem, in which a decision determines the next state with certainty even if the decision is 
chosen randomly. Of course, policy is not very important for a deterministic problem. 
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Figure 10.2. Changing probability of action d, “go diagonally downward ” at each 
non-terminal vertex (A—F) as trials progress in the jackpot journey problem. 


where Num^i is the number of black stones and Num w ^ e is the number of 
white stones. Accordingly, the probability of the other action u ) Pup, can be defined 
as 


Pup — 


Num wh\te 


Num bi^k + -^ wm white 


= 1-P. 


down - 


( 10 . 2 ) 


Figure 10.2 shows the changing Pdown at the six non-terminal vertices (A — F) 
as trials progressed. Initially, all signposts had 20 black stones and 20 white ones, 
and P(j own = Pup = 0.5. At vertices C and E, the specified action was action d 
whereas the dictated action was action u at vertex D. High P ( j own at vertex A (the 
starting point) favors action d. This must be because at the next vertex B, we have 
no chance to lose a path toward gold no matter action we take, unlike the situation 
at the other alternative vertex C. This simulation result confirms that such a simple 
reward-penalty scheme is applicable to learning an “optimal policy.” 

Michie applied this reward-penalty scheme to a tic-tac-toe game, using a few 
hundred matchboxes with colored beads [53]; the matchboxes corresponded to pos- 
sible game states, and the colored beads specified admissible actions. He called 
this model MENACE (Matchbox Educable Naughts And Crosses Engine), showing 
that the scheme drove MENACE to learn an “optimal policy” to win the tic-tac- 
toe game. Later, Michie proposed the “Boxes” system based on a scheme similar 
to MENACE to solve the pole-balancing control problem (see Sections 10.5.1 and 
18.2). 
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10.2.2 Credit Assignment Problem 

Through the jackpot journey, we have learned a simple reward-and-penalty scheme, 
which gives us a basic aspect of reinforcement learning. We notice, however, that 
this strategy is strictly success or failure driven; its adjustment scheme (i.e., 
increasing or decreasing stones) is always applied only when the final outcome be- 
comes known after the entire sequence of actions. Thus, it is close to ordinary 
supervised learning methods [Equation (10.7) in Section 10.3.1]. In other words, 
it ignores the intrinsic sequential structure of the problem to make adjustments at 
each state. The scheme seems to work well in small finite state space problems such 
as the well-known tic-tac-toe game and our jackpot journey problem; the entire 
state space can be searched rather easily within a reasonable amount of computa- 
tion time. Here the question arises: Is this goal-driven scheme really applicable to 
any game playing? Is all well that ends well? 

In playing chess, such a scheme seems impractical because of the huge number 
of possible states chess entails. The player (i.e., agent) would only receive feedback 
(or a reinforcement signal) concerning win or lose, after a fairly long sequence of 
moves. How would the learner (agent) know which moves were inappropriate after 
an unsuccessful experience? How would the learner know which moves may have 
been excellent? In chess playing, the learner needs to make better moves with no 
performance indication regarding winning during the game. The problem of re- 
warding or penalizing each state (or move) individually in such a long sequence 
toward an eventual victory or loss is called the temporal credit assignment prob- 
lem [55]. In contrast to temporal credit assignment, apportioning credit to the 
internal agent’s action structures is called the structural credit assignment prob- 
lem. In structural terms, it is necessary to determine which part and how much 
should be altered to enhance overall performance. The structural credit assignment 
problem is concerned with the development of appropriate internal representations. 
Any distributed learning model, such as a neural network (NN), usually involves 
both temporal and structural credit assignment at the same time (Section 10.5.2). 

The power of reinforcement learning actually lies in this point that the agent 
does not have to wait until it receives feedback at the end to make adjustments. 
Sutton clearly defined the theoretical aspects of this key concept in codifying tem- 
poral difference (TD) methods, which are discussed in Section 10.3. Its precursor, 
Samuel’s checkers-playing program [67], and more recent work, such as Holland’s 
bucket brigade [34] (in Section 10.10.1), also share this fundamental key concept 
that the states in a sequence should be evaluated and adjusted according to their 
immediate or near-immediate successors, rather than according to the final out- 
come [77]. For such evaluation purposes, we can use the computational measure of 
an evaluation function. Delayed reinforcement learning deals with a tempo- 
ral sequence of input state vectors aimed at optimizing an evaluation function, as 
in many real-world problems that involve delays between action and any resultant 
reinforcement [78]. 

In comparison, immediate reinforcement is determined by the most recent 




Sec. 10.2. Failure Is the Surest Path to Success 


263 


input-output pair alone. Sutton clearly described reinforcement comparison archi- 
tectures as follows [74]: 

Reinforcement comparison architectures are effective at optimizing im- 
mediate rewards, but not at optimizing total reward in the long run. The 
problem is that actions have two kinds of consequences — they affect the 
next reward and they affect the next state, but reinforcement compari- 
son architectures only take the first of these into account. Suppose an 
action produces high immediate reward but deposits the environment in 
a state from which only low reward can be obtained? In order to op- 
timize long-term reward, these delayed affects of action must be taken 
into account. 

Both adaptive heuristic critic (AHC) and Q-learning architectures can 
take such delayed effects into account, as described later in Sections 10.5 and 10.6. 
Various algorithms for “reinforcement comparison” are discussed by Dayan [22]. 

10.2.3 Evaluation Functions 

In Section 10.2.1, changing the number of stones affects the probability of an action 
at a state. As a result, the destination signposts indicate optimal trajectories; 
MENACE uses similar methods to find winning trajectories in the tic-tac-toe game. 
Their methods correspond to search strategies for optimal trajectories in artificial 
intelligence (AI) literature. The principal goal is finding successful trajectories 
among all admissible world states. The agent searches for states that yield reward 
by performing a series of actions. Learning thus can be viewed as a search. 

Evaluation functions, also known as value functions, play an important 
role in learning [44, 60, 63, 97]. The evaluation functions produce scalar values 
(reinforcement signals) of states to aid in finding optimal trajectories. (Such eval- 
uation functions correspond to the emotions in the biological brain [3].) In the 
AI field, it is widely accepted that the performance of search algorithms is greatly 
improved by the use of evaluation functions. Manhattan Distance, for instance, is 
a well-known heuristic evaluation function for the “eight puzzle” problem in which 
we must move tiles to target positions in a 3 x 3 square frame containing eight 
numbered square tiles and a “blank” space [18, 27, 42, 60, 64]. The Manhattan 
Distance heuristic computes the sum of each tile’s vertical and horizontal distance 
from its target position to estimate the number of moves required to reach the goal. 
It can operate as a heuristic function in the well-known A* algorithm [27, 60, 97], 
whose evaluation function combines a heuristic (or cost-to-go) function and a cost 
(or cost-so-far) function [66]. Winston [97] introduced an improved version of A* 
with a reflection of the dynamic programming principle , which we shall discuss in 
Section 10.4. Korf [42] proposed a learning real-time A* ( LRT A *) algorithm as an 
extension of the A* algorithm. Russell and Wefald well treated a family of these 
algorithms [65], and Barto et al. discussed a close tie between LRT A* and learning 
based on real-time dynamic programming [6]. 
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Manhattan Distance is not perfect, but it seems to reflect well the actual chances 
of reaching a given goal despite its simplicity. Evaluation functions must not require 
heavy computations; hence there is a dilemma between the accuracy of the evalua- 
tion functions and their computation cost. Major concerns regarding the evaluation 
functions are as follows: 

• How to devise effective evaluation functions, 


• How to store evaluation functions, 


• How to adjust evaluation functions. 

Devising effective evaluation functions may require problem-specific knowledge. 
Evaluation functions can be stored as tables, symbolic rules, decision trees [21], 
CMACs [3] (Cerebellar Model Arithmetic Computers/Cerebellar Model Articula- 
tion Controller) [46, 86] or parametric approximators. Adjusting schemes of evalu- 
ation functions may depend on the function forms. 

In Section 10.2.1, the agent’s behavior was totally result-driven; that is, the 
agent acted based on an evaluation-free algorithm. Unlike such a result-driven 
agent, an evaluation-driven agent employs the evaluation functions at each vis- 
ited state to estimate admissible actions and how likely the actions are to lead 
eventually to choosing the most promising action, without concern for what the 
value of the final solution is. In other words, evaluation functions are indispensable 
for evaluating states in a temporally successive manner ; we elaborate on this in the 
next section. 


10.3 TEMPORAL DIFFERENCE LEARNING 

Temporal difference (TD) methods are a class of incremental learning procedures 
specialized for prediction whereby credit is assigned based on the difference between 
temporally successive predictions [77]. 

In earlier studies, Samuel [67] employed a TD method in the checkers-playing 
program to improve parameterized evaluation functions (linear functions of input 
variables) through experiential tuning; it could learn on the basis of its experi- 
ence and thus improve its performance by updating parameters. More specifically, 
Samuel’s program used the evaluation difference between a board configuration and 
a likely future board configuration. It chose the most advantageous position in the 
absence of complete information at any temporal game state. The TD method was 
also employed to solve the temporal credit assignment problem in the predictor part 
of an adaptive critic [8, 5], which is discussed in Section 10.5. 

Sutton has formalized these TD methods as general methods for learning to 
predict arbitrary events (not just goal-related ones). 



Sec. 10.3. Temporal Difference Learning 


265 


Ordinary 


Classical 

Supervised • • 

TD(1)<. TD ( A ) i ^TD(O) 

Dynamic 

Learning 


Programming 


Figure 10.3. TD learning spectrum. TD( A) migrates various degrees of the TD 
learning spectrum according to the value of A. 


10.3.1 TD Formulation 

In the general form of TD methods, TD(A), the modifiable parameters w of the 
agent’s predictor obey the following update rule with a learning rate a: 

t 

TD(A) : A U) « = a(Vi +1 -K)^A 1 -*V^V t , (10.3) 

fc = l 


where V t is the prediction value at time t and A is a (discounting) recency parameter 
ranging from 0 to 1. Adjustments to predictions occurring k steps (or stages) in 
the past are exponentially weighted; more recent predictions make greater weight 
changes. This may match biological brain strategies for deciding how strongly re- 
cently received stimuli should be used in combination with the current stimuli to 
determine actions. This can be viewed as a usual supervised learning procedure 
for the pair current prediction Vt (“actual” output) and its subsequent predic- 
tion Vi+i (“desired” output) in the error term. In this sense, reinforcement (TD) 
learning can be regarded as a form of supervised error-correction learning, which 
often minimizes the squared error E t d between final outcome z and current predic- 
tion Vt : 

E td = \{z-V t } 2 

(10.4) 

= H£r=t(v*+i-v fc )} 2 , 


where V m +i = z. Figure 10.3 shows that TD(A) migrates various degrees of the TD 
learning spectrum according to the value of A. 

In two particularly extreme cases in which A = 1 and A = 0, we have TD(1) and 
TD(0), respectively, as follows: 


TD(1) : Aw t = a(K +1 - V,) £ V w V k , (10.5) 

k=l 

and 

TD(0) : Aw t = a(V t+ i - V t ) V w V t . (10.6) 

Note that here a convention “0° = 1” is used. In the special case in which TD(1) 
as defined in Equation (10.5), all past predictions make equal contributions to 



266 


Learning from Reinforcement Ch. 10 


the weight alterations; this means that all states are equally weighted. Because 
A w t <x - V w E td = V w V t (z - Vt ), due to Equation (10.4), Equation (10.5) can 
be rewritten as follows: 

A w t = a(z-V t )S7 w V t . (10.7) 

The prediction error is represented as the difference between final outcome z and 
current prediction V*. 

Consider the difference between Equation (10.5) and Equation (10.7). Equa- 
tion (10.7) is actually closer to an ordinary supervised learning procedure for 
the pair current prediction Vt (actual output) and its final consequent 2 (target 
output). Using Equation (10.7), A w t can be determined only after the whole se- 
quence of actions has been completed and the final outcome 2 is available; therefore, 
all state-prediction pairs must be remembered to make this adjustment. In other 
words, Equation (10.7) cannot be computed incrementally in multiple-step (stage) 
problems. Recall our previous jackpot journey problem in Section 10.2.1; the sign- 
post model realizes a type of learning similar to that stipulated in Equation (10.7). 
Past sequences must be remembered to make stone adjustments; that is, the traveler 
must remember past traces to apply the reward-penalty scheme to each signpost, 
and all states receive equal reinforcement signals (i.e., one stone). On the other 
hand, Equation (10.5) offers a way to compute incrementally, which saves memory 
space required to store past values. 

In the other extreme case in which TD(0) is as defined in Equation (10.6), only 
the most recent prediction affects the alterations. This is closer to a dynamic 
programming (DP) procedure discussed in Section 10.4. Application of Equa- 
tion (10.6) to the jackpot journey problem is discussed in the next subsection. 

Sutton [77] introduced a game-playing example, illustrated in Figure 10.4, to 
clarify the inefficiency of supervised learning methods in comparison with TD meth- 
ods. Figure 10.4 represents a case in which the trajectory followed from a new state 
reaches an unusual win via a bad state that has led 90% of the time to a loss and 
only 10% to a win from past experience. In this case, TD methods reasonably 
evaluate the new state by adjusting the values of the observed states in a tempo- 
rally successive manner. Supervised learning methods, on the other hand, associate 
the new state fully with the observed victory, although the new state is much less 
promising. 


10.3.2 Expected Jackpot 

We revisit the jackpot paradise discussed in Section 10.2.1; we show how TD(0) from 
Equation (10.6) works using the lookup-table perceptron with linearly independent 
state vectors depicted in Figure 10.5. The TD (perceptron) net has an identity 
function at the output layer. In this case, we can apply the linear supervised 
learning (Widrow-Hoff) rule; we have Vt = w T s t , and thus V w Vt = s t . Therefore, 
Equation (10.6) can be written as follows: 

Linear TD(0) : Aiu* = a(w T st + 1 — w T st)s t . 



Sec. 10.3. Temporal Difference Learning 


267 



Aw, CX V, +1 (s) - [v, (NEW)] 

Aw t+1 <x V, +2 (s) — V, +1 { s ) \ 

I ! \ 

* § *% * ' 

AWp cx jVp*, ( bad )j- Vp( S ) 

Aw^cx Vp+2 (s) ~|Vp+i ( bad ) j 

: : I 

A ^ m - ,0< y:.L s .L -Vm-i(s) / 
Aw m cx[v m-f ‘J winD - V m (s) / 


Aw, 




Figure 10.4. An example ( originally introduced by Sutton [77]) where supervised 
learning methods poorly evaluate a new state, which is reached for the first time, 
and then follows the trajectory marked by the dotted lines. TD learning methods 
predict an estimate of a new state, Vt(NEW), by considering temporally successive 
predictions, including V P +\(BAD), together with the observed victory (+1), whereas 
supervised learning methods associate V t (NEW) fully with V m+ i (WIN) (= +1). 



Figure 10.5. A lookup-table perceptron to approximate expected values in the jack- 
pot journey problem. 


Note that there are six nonterminal vertices in Figure 10.5. The states (i.e., 
vertices) are expressed as linearly independent unit-basis vectors [ s(i )] of length 6; 
that is, vertex A can be defined as a vector, (1 0 0 0 0 0) T , and vertex B can be 
defined as another linearly independent vector, (0 1000 0) T , and so on. 
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Table 10.1. Predicted probabilities for six nonterminal vertices at three distinctive 
learning stages using a lookup-table perceptron with a small learning rate (0.001). 
RMSE means “root-mean-squared error. ” 


Epoch 

Vertex A 

Vertex B 

Vertex C 

Vertex D 

Vertex E 

Vertex F 

RMSE 

■muxiB 


0L4248 





0.05624 



0.4922 

Hwwgw 

BBSS* 



0.00590 

Kroititl 


0.5008 




0.0 

0.00155 


0.375 


■iVM 

0.5 

0.5 

0.0 



Suppose our journey’s outcome is defined as z = 1 for the target vertex H where 
“gold” is present, and defined as z = 0 for the other terminal vertices G, I, and J, 
where no gold is present. For this choice of z, the expected value of each vertex is 
tantamount to the probability of reaching gold (vertex H) from that vertex. Sutton 
discussed a random- walk example [77] , in which he introduced a similar probabilistic 
interpretation. 

An agent randomly and equally likely selects either action d, “go diagonally 
downward,” or action u, “go diagonally upward,” at each vertex. This probability is 
not changed based on experience, as it was in the optimization problem discussed in 
Section 10.2.1. Given this particular “0.5-0.5 policy,” the agent learns the expected 
values. For the third action, the terminal reward provided by the world is used as 
the outcome value rather than the TD net’s output. (This is the realization of the 
boundary conditions.) 

The ideal probabilities of reaching gold (i.e., the desired predictions) for each of 
the nonterminal states are 0.375, 0.5, 0.25, 0.5, 0.5, and 0.0 for vertices A, B, C, 
D, E, and F, respectively. The results, obtained from the lookup-table perceptron 
with a small fixed learning rate (0.001), are shown in Table 10.1; many iterations 
were required for convergence. 

10.3.3 Predicting Cumulative Outcomes 

Consider a case in which each action in a sequence incurs a cost. For instance, in 
the simple path network sketched in Figure 10.6, numbers between any two vertices 
represent the cost of traveling that interval. At each vertex, we need V t to estimate 
the remaining cumulative cost rather than the total cost of the sequence. TD 
methods can be extended to deal with this case [77]. That is, TD methods axe not 
confined to predicting only the final outcome of sequences, but can also be used to 
estimate quantities that accumulate over sequences. 

Let Cf_|_ i denote the actual cost incurred between times t and t+1 . We want Vt to 
equal the expected value of Zt = YlT-t c *+ 1 > where m is the number of observation 
vectors in the sequence. 
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Figure 10.6. A simple cost path problem. 


As in Equation (10.7), the prediction error can be represented as 
(final outcome) — (current prediction) = z* — V t 

= Er= f (c*+i)-v, 

= Ete. (c*+i + Vi+i - V k ), 

where Vm+\ = 0 (boundary conditions). We can thus derive the update rule for the 
following cumulative TD(A): 


t 

Aw, = a(c+i + V, +l - V,) Y, \ l - k V w V k . (10.8) 

fc=l 

We discuss this cost path problem in greater detail in Section 10.7. 

In infinite prediction problems, there are no goal states to terminate the sum z t . 
To prevent the divergence of this sum, discounting factors can be introduced. Thus, 
the agent’s objective is to minimize the discounted sum of future costs given by 

OO 

y: 7* c *+fc+i = Ct+i + 7Q+2 + 7 2c t+3 H (10.9) 

k = 0 

In this case, Equation (10.8) will be 

t 

Aw, = o(c +1 + yV t+l - V,) Y A‘“*V„V*. (10.10) 

k = 1 

TD(A) has been proven to converge only for a linear network with linearly in- 
dependent state vector inputs [23, 77]. (We described such a linear network in 
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Section 10.3.2.) However, by constructing a nonlinear NN with state vectors not 
linearly independent that used TD(A) to play the game of backgammon, Tesauro 
was able to report the successful game application (TD-gammon) and discussed 
practical issues concerning TD learning [78, 79, 80, 81]. 

Bertsekas [17] presented a counterexample to TD learning, in which a represen- 
tation of the evaluation function constructed by TD(A) becomes worse as A changes 
from 1 to 0; the optimal representation was obtained when A = 1. 

10.4 THE ART OF DYNAMIC PROGRAMMING 

Dynamic programming (DP) is an optimization procedure that is particularly 
applicable to problems requiring a sequence of interrelated decisions [26]. 

In connection with DP, several researchers have discussed reinforcement learning. 
Watkins [86, 87] proposed a class of Q-learning (discussed in Section 10.6) that 
employs “incremental DP,” which we shall describe later. We must emphasize the 
following two aspects of DP with respect to reinforcement learning: 

• DP successively approximates optimal evaluation functions by solving recur- 
rence relations, instead of conducting searches in state space. 

• Backing up state evaluations is a fundamental property of iterative procedures 
used to solve the recurrence relations. 

We clarify these points in the following two subsections. 

10.4.1 Formulation of Classical Dynamic Programming 

For the purposes of later discussion, we present a brief overview of the basic art of 
DP. For more thorough treatment of classical DP, refer to refs. [11, 26]. Although 
there are many variations of DP, we focus on major concepts common to all DP 
procedures, following the definitions of a book entitled The Art and Theory of 
Dynamic Programming by Dreyfus and Law [26]. 

First, we delineate the principle of optimality [11], the backbone of DP, which 
flows from intuition: 

An optimal policy has the property that whatever the initial state and 
initial decision are, the remaining decisions must constitute an optimal 
policy with regard to the state resulting from the first decision. 

In light of the principle of optimality, the basic procedural insights of ( backward ) 
DP are as follows [26]: 

• Recognize that a given “whole problem” can be solved if the values of the best 
solutions of certain subproblems can be determined according to the principle 
of optimality. 
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• Realize that if one starts at or near the end of the “whole problem,” the 
subproblems are so simple as to have trivial solutions. The principal attraction 
of (backward) DP resides in such ease. 

In the art of DP, it is important to define “state space” appropriately and to choose 
the arguments of the optimal value function (rule of assigning values to various 
subproblems) so that a real process under consideration has the Markov prop- 
erty [35] and thus the principle of optimality holds. The usual DP formulation 
consists of determining the four appropriate definitions: (1) an optimal value func- 
tion, (2) an optimal policy function, (3) a recurrence relation, and (4) boundary 
conditions. 

Recall the cost path problem illustrated in Figure 10.6 in Section 10.3.3. There 
we sought the expected cost of a given policy using TD methods. (With no action 
choice to be made, this is not a DP problem.) Now we consider that the objective 
of the learning agent is to select actions so as to drive the world to a goal state (i.e., 
terminal vertex) g G G C S at a minimum cumulative cost, where G is a set of 
goal states and S is the set of all states. We introduce a measure of value [i.e., the 
optimal value function V (s)] as follows: 

V (s) = the value of the minimum cost of going from a state s to a goal g. 

The recurrence relation given by the principle of optimality uniquely defines these 
values. That is, the minimal cost of a state s must equal the cost of the best 
action a, plus the minimal cost of the next state designated by the action a: 

V{s) = min oGactions {cost(s,a) + V(next(s, a))} , Vs G S - G, (10.11) 

where next(s, a) denotes the state subsequent to state s dictated by action a, and 
cost(s,a) signifies the incurred cost of the transition following action a. In other 
words, the evaluation V (s) of state s should be equal to the best of “cost(s, a) plus 
the value of the next state t = next(s, a) that can be reached in one action a.” The 
values of the goal states (or terminal vertices) are defined by 

F(s) = 0, Vs € G. (10.12) 

Equation (10.12) shows the boundary conditions. (Note that the optimal policy 
function n is obvious from Equation (10.11); that is, 7r(s) = a such that V (s) = 

min aeactions { cost ( s > a ) + V(next(s, a))} .) 

For a stochastic case of this minimum-cost path problem, we generally assume 

that when action d is suggested, it is followed with probability Pdi ; with probability 
Pul ( = 1 — Pdi), the opposite action u is tried. Likewise, when action u is 

instructed, it is followed with probability P u 2 ; with probability P <*2 (= 1 — Pu 2 ), 
action d is performed. Now V (s) is the optimal expected value function: 

V (s) = the expected value of the minimum cost of 

going from a state s to a goal g. 
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Then we can specifically define the recurrence relation in a general form as follows: 


V(s) = min 


Pd 1 {cost(s, d) + V ( t d )} 


+ P ul {cost(s,u) + V(t u )} 


jP u2 {cost(s, u) + V ( t u )} 


+ Pd 2 {cost(s, d) + V (< d )} 


(10.13) 


where t d = next (s,d), and t u = next (s,u). 

TD methods in Section 10.3 approximate the evaluation function for a given 
policy without knowing the state-transition probabilities and the function deter- 
mining expected payoff values. Barto et al. [10] mentioned that the TD learning 
process is a Monte-Carlo approximation of a successive DP approximation method. 
They also discussed that TD methods bear a great resemblance to DP from a learn- 
ing standpoint [9]. The aforementioned backward DP procedure works from the 
end of a decision task to its beginning. It seems hardly related to animal learning 
processes, due to this back-to-front processing. Yet Barto et al. showed how TD 
learning can obtain much the same result as DP by repeated forward passes through 
a decision task; the computation is incrementally accomplished by moving forward 
to goal states. This reversed viewpoint can be found in so-called forward DP; see 
also ref. [26]. 


10.4.2 Incremental Dynamic Programming 

Applying classical DP requires knowledge of state-transition probabilities, as demon- 
strated in Equation (10.13). In the absence of explicit probability information, we 
can approximate the value function. To improve the value function approxima- 
tion V(s), we can apply incremental DP [6, 16, 73, 75] (also known as approximate 
DP [91]) given by the following functional equation 3 : 

V(s) <- min ae ^ |cost(s,a) -I- V(next(s,a)) j , Vs € S — G, (10.14) 

where for s 6 G, F(s) = V(s) = 0 (boundary conditions). Sutton [75] claimed that 
iterative application of Equation (10.14) to one state after another makes V a better 
and better approximation ofV. This operation is, in essence, a one-stage-ahead (or 
one-ply) search from state s, followed by replacement of the approximate value V (s) 
with its backup value. This procedure seems convincing, but there is one important 
caveat that convergence may not be monotonic; that is, the procedure will converge 
to correct values if done often enough at all possible states. Each value may not 
necessarily improve at each backup because the approximate value V(s) will be 
worse if a state has a right value and the next state has a wrong one; the value 
will then temporarily go wrong. In stochastic problems, only the average of many 
(wrong) values may be right, or may approach correct if the learning rate goes to 
zero appropriately. In this sense, the performance may improve incrementally while 
the procedure is being accomplished. 


3 “a <— b ” represents “set a equal to 6.” This notation is also used in Algorithms 10.1 and 10.2. 
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As the value function approaches the optimal one, the current policy becomes 
more optimal by degrees. The optimal policy can (it is hoped) be obtained by 
selecting actions that minimize the right-hand side of Equation (10.11) at each 
state. That is, the learning process converges when Equation (10.11) holds for all 
states [73]. Watkins [86, 87] used this key concept to extend TD methods and 
proposed a class of learning called Q-leaming (Section 10.6). 

When the agent’s value predictor is realized by an NN function approximator, 
the weight parameter update rule simply uses the mismatch between the left-hand 
side of Equation (10.14) and the right-hand side (i.e., the TD error in recursive equa- 
tions because such recursive equations can express the desired sequential structural 
relationships in the given problem). Here we note that the backing-up property of 
DP shares basic concepts with TD methods. 

10.5 ADAPTIVE HEURISTIC CRITIC 

In this section, we discuss a reinforcement learning strategy based on the so-called 
adaptive heuristic critic (AHC) or actor-critic model. It provides a way of 
attempting to find both optimal actions and expected values. The phrase “learning 
with a critic” was used by Widrow et al. [94] to differentiate it from ordinary 
supervised learning characterized by the phrase “learning with a teacher.” In some 
situations, where there is less information available on desired outputs, only right 
or wrong signals will be provided. 

The AHC model typically includes two principal components: the critic (eval- 
uation/prediction) module and the action (control) module. The critic generates 
an estimate of the value (or evaluation) function from state vectors and external 
reinforcement supplied by the world (or environment) as inputs. That is, the critic 
plays an important role in predicting the evaluation function. The critic is adaptive 
because its predictor component is updated using TD methods. The action module 
attempts to learn optimal control or decision-making skills. 

10.5.1 Neuron-like Critic 

Barto et al. introduced a neuron-like adaptive critic model 4 for a pole-balancing 
control problem [8]. They discussed their model in conjunction with an earlier study 
of the Boxes system that Michie and Chambers proposed [54]. 

The Boxes system was constructed to solve the pole-balancing control problem; 
its central concept was “task decomposition” whereby state space was partitioned 
into 162 non-overlapping subspaces called “boxes,” and different controllers were 
used in each digitized state space (box). No generalization was attempted across 
subspaces (see Section 10.9.1). Like Michie’s earlier work with MENACE, discussed 
in Section 10.2.1, the Boxes model is result-driven — specifically, failure-driven in this 
case. That is, the model receives no feedback on performance until the pole falls 

4 Concise description of their work can be found in ref. [85]. Barto presented a good overview 
of the adaptive critic methods in ref. [4]. 
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down (failure). Through many trials, the Boxes system learns to solve the nonlinear 
control problem; every failure is a stepping stone to success. 

Inspired by the Boxes method, Barto et al. constructed an adaptive critic 
model [8] that consists of ACE (adaptive critic element) and ASE (associative search 
element) based on an “associative reward-penalty” algorithm [8, 5, 96]; each com- 
ponent is expressed in an ADALINE [95] (ADAptive LINear Element) — what they 
call a “neuron-like connectionist element.” The ASE implements and adjusts the 
decision policy (or control rules). The ACE learns to provide current evaluations 
of control decisions by virtue of failure signals. Specifically, the ACE predicts the 
internal reinforcement signal r(t + 1) (associated with particular input states): 

f{t + 1) = r t + 1 + Wt+i ~ Vu 

where Vt is the current prediction and 7 is a discounting factor. This r(t + 1) 
corresponds to the TD error discussed in Section 10.3.3 [see Equation (10.10)]. The 
r(t + 1) is sent to the ASE to determine a control action. 

Performance comparison with the Boxes system pointed out an advantage of 
Barto et al.’s critic model, “ASE with internal reinforcement supplied by an ACE [8]”: 

The boxes system is restricted in that its design was based on the a 
priori knowledge that the time until failure was to serve as the evaluation 
criterion and that the learning process would be divided into distinct 
trials that would always end with a failure signal. . . . The ASE, on 
the other hand, is capable of working to achieve rewarding events and 
to avoid punishing events which might occur at any time. It is not 
exclusively failure-driven, and its operation is specified without reference 
to the notion of a trial. 

In their adaptive critic model, Barto et al. used 162-component linearly inde- 
pendent state vectors (or standard-unit-basis vectors) as inputs. In other words, 
this model was equivalent to a big lookup table similar to the lookup-table percep- 
tron shown in Figure 10.5. Thus, this model offered no possibility for generalizing 
among states [8, 5]. For large state space problems, using lookup tables is imprac- 
tical, as is visiting all states. Thus, the agent needs to generalize from a limited 
amount of experience using a compact representation of input state vectors (see 
Section 10.7.3). 

10.5.2 An Adaptive Neural Critic Algorithm 

This subsection provides a general description of an AHC algorithm implemented 
in such parameterized functional forms as NN function approximators (value and 
action function approximators). The AHC model basically consists of two NNs: 
the value NN and the action NN. The value NN approximates evaluation functions, 
mapping states to expected values, whereas the action NN generates a plausible (or 
legal) action, mapping states to actions [48, 49, 50]. 
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Computational Learning Agent 



Figure 10.7. An AHC model that uses neural network function approximators: the 
value NN and the action NN. 


Figure 10.7 illustrates a block diagram of such an NN-based AHC model. The 
adaptive critic receives external (primary) reinforcement from the world and trans- 
forms it into internal (heuristic) reinforcement. 

The weight update rule follows the usual error minimization scheme used in 
supervised learning by defining the following squared error -E/pD : 

„ 1 o 

EpD = 2 err or% 


with 

error = (value incurred by a selected action) 

-I- 7 (expected value of observed successive state produced by value NN) 
— (expected value of current state produced by value NN), 

where 7 is a discount rate [see Equation (10.4)]. Both the action NN and the 
value NN are trained simultaneously using Epj). Algorithm 10.1 explains an im- 
plementation of the AHC concept. 

Algorithm 10.1 Adaptive heuristic critic using neural network function approxi- 
mators 

1. Observe the current state: s current state s„. 

2. Use the value NN to have V(s): e «— V(s). 

3. Select an action a n by using the output of the action NN. 



276 


Learning from Reinforcement Ch. 10 


4. Execute the action a n . 

5. Observe the successive new state t n and reinforcement r n . 

6. Use the value NN to compute V{t n ). 

7. E <— r n + 7 V(t n ). 

8. Adjust the value NN by backpropagating the error (= E — e). 

9. Adjust the action NN according to the error. 

□ 


Recall Equations (10.8) and (10.10); Algorithm 10.1 shows that learning pro- 
ceeds concurrently with processing; weight coefficients of the action NN and the 
value NN are adjusted to fit learned values and actions by TD methods. 

Algorithm 10.1 demonstrates how to use the output of a value NN to control 
the output of an action NN by means of backpropagation. To be precise, suppose 
we want to maximize the expected value (i.e., reward). Each action should be 
chosen to maximize the sum of all future rewards. Thus given an action a n , if 
the value of the next state V(t n ) [or j V(t n )] plus the external reinforcement r n 
is greater than the value of the current state U($„) (i.e., E — e > 0), the 

action looks better than previously expected, and therefore, that action should be 
reinforced. Conversely, if the reverse inequality holds, the action does not seem 
better. (This is because we want to maximize the cumulative future rewards.) The 
action should therefore be inhibited. This procedure demonstrates our intuitive 
notion presented in the introduction to this chapter. [The intuitive notion can also 
be seen in Equation (10.14) of the incremental DP discussed in Section 10.4.2.] This 
is a reward-penalty algorithm. In contrast to it, in the reward-inaction algorithm [5, 
59, 96], actions are not updated when the selected action does not look better; this 
algorithm is based on the concept that before other actions axe put into practice, 
nothing about their effectiveness can be learned from a single experience. 

In this way, the action NN and the value NN evolve together in the attempt 
to find an optimal policy. In other words, TD methods are employed to solve 
temporal credit assignment problems, and the backpropagation algorithm is used 
for structural credit assignment problems. 

Werbos [36, 88, 89, 90] discussed adaptive critic designs and a relationship 
between DP and TD methods by using the terms heuristic dynamic program- 
ming (HDP), dual heuristic programming (DHP), action- dependent heuristic dy- 
namic programming (AD-HDP), action-dependent dual heuristic programming (AD- 
DHP), and global dual heuristic programming (G-DHP). A class of DHPs approxi- 
mate the derivatives of action- value (or value) functions. 
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10.5.3 Exploration and Action Selection 

In the supervised learning discussed in Chapter 9, the agent usually acts ac- 
cording to the gradient vector provided by a teacher; the teacher implements a 
reward-penalty scheme to adapt the network’s weights. On the other hand, the 
reinforcement learning agent has no teacher to supply such directed information 
during its learning; it usually receives a scalar reinforcement, which is just infor- 
mation about the current evaluation of behavior. 

The reinforcement learning agent encounters a conflict between how it 
has to change behavior in order to obtain directional information, and 
how the resulting directional information tells it to change its behavior 
for improvement [4]. 

In other words, the following two factors must influence each action selection, and 
they ordinarily conflict [4, 82, 32]; 


1. The desire to acquire more knowledge about actions’ consequences 
to make better selections in the future. 

2. The desire to use what is already known about the relative merits 
of the actions. 

The best decision for one is not best necessarily for the other. This dilemma between 
exploration and exploitation (or conflict between identification and control) is absent 
in supervised learning. 

Usually, a necessary component of any form of reinforcement learning algo- 
rithm involves the learner’s random search behavior. It would be a cure for the 
aforementioned dilemma. Such random exploration in the space of possible solu- 
tions is important because no direction information toward the right answer is then 
available. To organize the desired exploratory behavior, stochastic action selection 
should be considered. The output of the action NN is interpreted as exponents Si 
in a Boltzmann distribution that yields the probability of an action af. 

probW = (1015) 

where T is a temperature parameter for an annealing process. Of course, an action 
favored by the action NN has more chance of being selected. When the cooling- 
temperature scheme is employed, exploration in the solution space will gradually 
shrink to favor a deterministic action selection. This sort of stochastic exploration 
provides a way of estimating directional information to change behavior toward 
performance enhancement in exploring the environment. Thus, it alleviates the 
dilemma between exploration and exploitation. 

In action NN structural terms, there are several possibilities. For two-action 
problems such as the jackpot problem in Section 10.2.1 and the cost path problem 
in Section 10.7.2, there might be three possibilities; 
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Figure 10.8. Three possible action NN architectures of the AHC-net for two-action 
problems such as a cost path problem in Section 10.7.2; “Up” means “go diagonally 
upward” and “down” denotes “go diagonally downward. ” 


1. Use one action NN with a single output unit. 

2. Use one action NN with two output units, one unit for each action. 

3. Use two action NNs with a single output unit; one NN for each action. 

These architectures are illustrated in Figure 10.8. For instance, a single-output 
action NN in Figure 10.8(a) can be applied to a deterministic two-action cost path 
problem in Section 10.7.2. The action NN produces an output ranging from 0.0 (ac- 
tion u) to 1.0 (action d) [93]. 

Furthermore, stochastic units can be chosen as the output units for action 
NNs [8, 31, 32, 33]. The stochastic units are usually regarded as the ordinary 
deterministic threshold units (such as a sigmoidal logistic function) with a random 
or noisy threshold in a form of f(x -|- noise). 

10.6 Q-LEARNING 

Q- learning is a form of model-free reinforcement learning [87]. In this section, we 
investigate “one-step Q-learning,” a simple class of Watkins’s Q-learning [86, 87]. 
Throughout the chapter, we use the term Q-leaming. 

10.6.1 Basic Concept 

Q-learning is a simple way of solving with incomplete information Markovian action 
problems based on the action- value function Q that maps state- action pairs to 
expected returns. The idea of assigning values to state-action pairs can be seen 
in DP; Watkins [86] called Q-learning the “incremental version of DP.” Q-learning 
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successively improves its evaluations of particular actions at particular states, just 
as incremental DP, discussed in Section 10.4.2, improves its evaluations of particular 
states. 

Learning proceeds as with TD methods; an agent tries an action at a particular 
state and evaluates its consequence in terms of the immediate reward or penalty it 
receives from the world and its estimate of the value of the state resulting from the 
taken action. The aim of the agent is not merely to maximize its immediate reward 
in the current state, but to maximize the cumulative reward it receives over some 
period of future time. 

AHC reinforcement learning architecture requires two fundamental memory 
buffers: one for the evaluation function and one for the policy. On the other hand, 
Q-learning maintains only one: a pair of state 5 and action a (i.e., an estimate 
Q-value of taking a in s). Instead, Q-learning requires additional complexity in 
determining the policy from the Q- values, as we show in this discussion. 

The objective in Q-learning is to estimate the values of an optimal policy. The 
value of a state can be defined as the value of the state’s best state-action pair: 

V(s) = max a Q(s,a). (10.16) 

The optimal policy is determined according to the policy function 7 r: 

7r(s) = a such that V(s) = Q(s,a) = msoc be&c ^ ons Q(s,b). (10.17) 

In one-step Q-learning, only the action- value function Q of the most recent state- 
action pair is updated after a one-step delay; the update rule for Q-values at the 
nth stage, where s n is the current state and a n is the selected action, is based on 
TD methods: 

/ \ ( Qn—i (^, n) ~l“ T} n [r n + jVn-iitn) Qn— i (s, fl)] if 5 — s n and O, — CL n , 

^a>-\ Q n _ l{s ,a) otherwise, 

(10.18) 

where 

Vn-i(t) = max &eactions 

In the early stages of learning, the Q-values may not accurately reflect the policy 
they implicitly define. By trying all actions in all states repeatedly, the agent learns 
which are best overall, judged by the long-term discounted reward. In other words, 
the “current Q(s, a)” is the “expected Q-value of taking action a in state s, and 
then using optimal actions in all future states.” 

10.6.2 Implementation 

Now we consider implementing a Q-learning NN (Q-net). Algorithm 10.2 presents 
a training procedure for such a Q-net, as illustrated in Figure 10.9. 

Algorithm 10.2 One-step Q-leaming using neural network function approxima- 
tors 




280 


Learning from Reinforcement Ch. 10 


State 


Action 



Q-Value 


Figure 10.9. A Q-net architecture. 


1. Observe the current state s n . 

2. Select an action a n by a stochastic procedure. 

3. For the selected action a n , use the Q-net to compute U a : U a f— Q n -\ (-s n? ^n)- 

4. Execute the action a n . 

5. Observe the resulting new state t n and reinforcement r n : t <- t n . 

6. Use the Q-net to compute Q n -i(t, 6), b E actions. 

7. u <— r n 4- max &€actions Qn-i(t>b). 

8. Adjust the Q-net by backpropagating the one-step error: 

_ f u — U a if s = s n and a = a n , 

\ 0 otherwise. 


□ 


A selected action usually matches 7r(s) as defined in Equation (10.17), but oc- 
casionally arrives at an alternative; for instance, the policy is stochastically imple- 
mented according to a Boltzmann distribution as in Equation (10.15): 


Prob(ai) 


exp[S!i2il] 

£* expt 2 ^] ’ 


(10.19) 


where T is the temperature parameter for an annealing process. 

Dayan [23] showed that TD(0) corresponds to a special case of Q-learning when 
there is just one admissible action for each state. Lin discussed the comparison 
between AHC-nets and Q-nets [48, 50]. Chapman and Kaelbling discussed the 
“generalization” problem in the Q-learning framework [21]; they described the G 
algorithm, which is based on recursive splitting of the state space according to 
statistical measures of differences in reinforcements received. It incrementally builds 
up a tree-structured table of Q- values. 
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Figure 10.10. The configuration of a cost path problem in a coordinate system. 


Table 10.2. Five experimental representations of input states. 


Vertex 

Linearly independent 

6-D unit-basis vectors 

3-D 

vectors 

2-D coordinates 

1-D 

scalars 

(a) 

(b) 

A 

( 1, 0, 0, 0, 0, 0 ) 

( 0. 0, 1 ) 

(1,4) 

( 0.01, 0.04 ) 

1 

B 

( 0, 1, 0, 0, 0, 0 ) 

( 0, 1, 0 ) 

(2,3) 

( 0.02, 0.03 ) 

2 

C 

( 0, 0, 1, 0, 0, 0 ) 

( 0, 1, 1 ) 

(2,5) 

( 0.02, 0.05 ) 

3 

D 

( 0, 0, 0, 1, 0, 0 ) 

( 1, 0, 0 ) 

(3,2) 

( 0.03, 0.02 ) 

4 

E 

( 0, 0, 0, 0, 1, 0 ) 

( 1, 0, 1 ) 

(3,4) 

( 0.03, 0.04 ) 

5 

F 

( 0, 0, 0, 0, 0, 1 ) 

( 1, 1. 0 ) 

(3,6) 

( 0.03, 0.06 ) 

6 


This section concludes with one notice of a convergence theorem: Q-learning 
based on look-up tables has been proven to converge to optimal values and decisions, 
whereas the AHC-leaxning illustrated in Figure 10.7 has not been. Watkins and 
Dayan [87] specifically showed that Q-learning converges to the optimum action- 
values with probability 1 as long as all actions are repeatedly sampled in all states 
with discrete action- values, and the learning rate goes to zero appropriately. 


10.7 A COST PATH PROBLEM 

This section considers a simple cost path problem configured in the triangular three- 
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Table 10.3. Five experimental TD-net structures. The fifth TD-net has bias units, 
whereas the other four do not. Parameter number means the number of adjustable 
weights in a network. For input state representations , see Table 10.2. 


TD-net 

Structure 

Bias 

units 

Parameter 

number 

Learning 

rate a 

Input state 

representations 

1 

6 x 1 

X 

6 

0.0007 

6-D unit-basis 

2 

3x5x1 

X 

20 

0.00009 

3-D vectors 

3 

2x4x1 

X 

12 

0.00007 

2-D coordinates (a) 

\mm 

2x4x1 

X 

12 

0.007 

2-D coordinates (b) 

5 

1x3x1 

o 

10 

0.0000009 

1-D scalars 


stage path network illustrated in Figure 10.10. It is a discrete-action environment, 
and each path incurs a cost. We have already discussed the TD formulation of value 
functions for predicting cumulative outcomes in Section 10.3.3. We first construct 
NNs based on TD methods (TD-nets) to predict expected costs. We then consider 
actions required to find an optimal minimum-cost path, and investigate both AHC- 
nets that execute an adaptive heuristic critic concept, and Q-nets that implement 
Q-learning. Of course, the state-transition probabilities and the incurred-cost data 
are unknown to the NNs (TD-net, AHC-net, and Q-net). 

We examine a lookup-table perceptron with linearly independent 6-D input state 
vectors, as discussed in Section 10.3.2. For comparison purposes, we also examine 
NNs with hidden layers, employing input state representations different from those 
of the linearly independent 6-D vectors in an attempt to draw generalization capabil- 
ities. Five experimental representations of input states are presented in Table 10.2. 

10.7.1 Expected Cost Path Problem by TD methods 

In this subsection, we assume that the probability of action d and that of action u are 
the same, 0.5. The agent’s objective is to predict the expected cost of the particular 
“0.5-0.5 policy” at each vertex using TD-nets. The TD-nets were trained using 
Equation (10.8) defined in Section 10.3.3. Table 10.3 presents five experimental 
architectures of the TD-nets in accordance with the five representations of input 
states in Table 10.2. 

For the purposes of testing NN approximation capabilities, this trivial cost path 
problem has the advantage of enabling us to compute the correct answer readily by 

V ( s ) = 0.5{cost(s, d) + V ( td )} + 0.5{cost(s, u) + V (£ u )}, 

where s, d , it, td, t u , and cost(.) axe as defined in Section 10.4.1. (This is not a DP 
problem because no action choice is to be made.) The expected costs starting at 
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Table 10.4. Expected costs for the six nonterminal vertices using five different 
TD-nets. They were trained by the TD( 0.3), in which the recency parameter A is 
0.3. RMSE signifies “root mean squared error.” 


TD-net 

Vertex 

A 

Vertex 

B 

Vertex 

C 

Vertex 

D 

Vertex 

E 

Vertex 

F 

Required 

epoch 

RMSE 

1 


K2£9 


B2E2B 


2.018 

29,200 


2 

9.301 

EMI 

5.827 

Baai 





3 

8.939 

4.881 

7.451 

B32B 

4.515 



1.714 

4 

9.001 

6.434 





229,000 

0.607 

5 

9.343 

7.987 

6.386 

4.806 

3.535 

2.681 

9,850,000 

0.717 

Target 

9.25 

6.5 

6.0 

5.0 

4.0 

2.0 

— 

0 


each vertex can be computed by the preceding equation to be 

V(F) = 2, V(E) = 4, V(D) = 5, V{C) = 6, V{B) = 6.5, V{A) = 9.25. 
These are the desired predictions. 

Table 10.4 shows the results obtained by the five TD-nets. It was observed that 
when the recency parameter A was about 0.3, the performance was better than the 
two extreme cases of TD(0) and TD(1); this finding coincides with results from 
other problems presented by Sutton [77]. 

10.7.2 Finding an Optimal Path in a Deterministic Minimum Cost 
Path Problem 

In the following subsections, we consider actions , using AHC-nets in Figure 10.7 
and Q-nets in Figure 10.9. The agent’s goal is to learn the optimal sequence of 
three actions that minimizes the total cost (i.e., the sum of the costs associated 
with each of the three steps). A glance at Figure 10.10 shows that the optimal path 
can readily be found: 

Optimal path: vertex A => vertex B => vertex E ==>- vertex H. 

We assume that our computational learning agent always starts at vertex A. 
The agent chooses either action d or action u at each vertex. The environment 
accordingly informs the agent of the cost associated with the action taken. This 
process is repeated three times. For the third action, the terminal cost provided by 
the world (environment) is used as the value rather than the NN’s output. This is 
the realization of the boundary conditions. 


AHC-net Simulation 
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Table 10.5. Four experimental AHC-nets trained by TD{ 0). Learning rates denoted 
by a were chosen by a trial and error process. All action NNs have bias units. 


AHC-net 

structures 

Value NN 

Action NN 

with bias units 

State representations 

(in Table 10.2) 

W 

size 

6 x 1 

6x4x1 

6-D unit basis 

a 


0.15 

X 

size 

3x2x1 

3x6x1 

3-D 

a 

0.00009 

0.009 

Y 

size 

2x4x1 

2x4x1 

2-D coordinates (a) 

a 

0.0007 

0.00001 

Z 

size 

2x4x1 

2x4x1 

2-D coordinates (b) 

a 

0.007 

0.0007 


Table 10.6. Results obtained by four AHC-nets whose architectures W, X , Y, and 
Z, are presented in Table 10.5. Pdown denotes “probability of action d. ” 


AHC 

nets 

Expected 

values 

Vertex 

A 

Vertex 

B 

Vertex 

C 

Vertex 

D 

Vertex 

E 

Vertex 

F 

Required 

epoch 

W 

•^down 

Costs 

0.992 

4.079 

0.034 

3.019 

(0.023) 

(2.122) 

(0.009) 

(1.047) 

0.999 

2.024 

(0.998) 

(0.014) 

300,000 

X 

^clown 

Costs 

0.994 

4.030 

0.0104 

3.016 

(0.006) 

(5.716) 

(0.050) 

(1.013) 

0.998 

2.010 

(0.952) 

(1.794) 

500,000 

Y 

•^down 

Costs 

0.927 

4.153 

0.351 

2.956 

(0.964) 

(2.614) 

(0.024) 

(0.710) 

0.759 

1.631 

(0.974) 

(2.684) 

15,124,500 

Z 

•^down 

Costs 

0.956 

4.778 

0.070 

3.302 

(0.999) 

(5.513) 

(0.0003) 

(1.335) 

0.945 

2.117 

(0.999) 

(3.918) 

8,350,000 

Target 

^clown 

1.0 

0.0 

(0.0) 

(0.0) 

1.0 

(1.0) 


values 

Costs 

4.0 

3.0 

1 

(2.0) 

(1.0) 

2.0 

(0.0) 



We implemented AHC-nets consisting of two feedforward NNs (or multilayer per- 
ceptrons): the value NN and the action NN, as illustrated in Figure 10.7. Table 10.5 
presents four experimental architectures of the AHC-nets — W, X, Y, and Z — as well 
as their representations of input states. They were trained by TD(0). In addition, 
all action NNs had the architecture (a) in Figure 10.8 with bias units , and they 
were trained in conjunction with the well-known backpropagation learning rule with 
a momentum term (0.8). Notice that in structure W of the AHC-net, the value 
NN (6 x 1) is equivalent to the first TD-net in Table 10.3 and the lookup-table 
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Table 10.7. Two Q-nets’ architectures and their resulting convergence perfor- 
mance. These Q-nets have no bias units. Their learning rates were optimized by a 
process of trial and error. 


Q-net structure 
(size) 

Input state 
representations 

Parameter 

number 

Learning 

rate 

Required 

epoch 

Look-up table method 
(two “6 x 1” perceptions) 

6-D unit basis 

12 

0.3 

400 

A Q-net with one hidden 
layer (3x5x1) 

2-D coordinates (a) 

with one “action” 

20 

0.03 

209,500 


perception we tested in Section 10.3.2. 

Algorithm 10.1 in Section 10.5.2 explains implementation of the AHC concept 
in an NN framework; If the value of the next state V(t n ) plus the external cost 
r n was smaller than the value of the current state V(s n ) (i.e., if the TD error was 
negative ), the action looked better because we wanted to minimize the cumulative 
costs. Conversely, if the TD error was positive, the action was inhibited. This 
was intended as a demonstration of the intuitive notion discussed in Section 10.5.2. 
Through repeated passes, the AHC-nets learned the optimal actions as well as the 
expected minimum costs. 

Table 10.6 shows the results obtained by the four AHC-nets: W, X, Y, and 
Z. As seen in the results from (Y), the AHC-net Y did not strongly inform us of 
the optimal action choices at vertices B and E. Of course, policy (i.e., a complete 
mapping states to actions) is not very important for this deterministic problem, but 
two AHC-nets Y and Z failed to yield the optimal decision at vertex C, whereas 
AHC-net X optimized the action choice. Additionally, the results from (X) show 
that the expected costs of the two vertices C and F were not close to the desired 
expected ones although the action choices were optimized. This is because little 
learning could take place at those vertices that were rarely visited as decisions 
became (we hope) more optimal. 


Q-learning Simulation 

In this subsection, we apply Q-learning to solving the minimum cost path problem. 
To compare with the look-up table method, we tested the Q-net that has one hidden 
layer with state-action pairs as inputs illustrated in Figure 10.9. More specifically, 
the Q-net had three input units: two for the 2-D coordinate state representations (a) 
in Table 10.2, and one for selected actions; that is, we explicitly represented input 
vectors by using an extra input unit for an action whereby action d corresponded 
to —1 and action u to +1. For instance, a state of vertex A with action d was 
expressed in a vector, ( 1, 4,-1 ) T . The experimental Q-nets have no bias units. 
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Table 10.8. The desired Q-values at the six non-terminal vertices. Actions u and 
d denote “go diagonally upward” and “go diagonally downward,” respectively. 


Action Vertex A Vertex B Vertex C Vertex D Vertex E I Vertex F 



Vwtex A Vertex B Vertex C 



Figure 10.11. The Q-leaming curves for the six non-terminal vertices obtained 
using a Q-net with one hidden layer (3 x 5 x 1). The horizontal axis represents 
the epoch number and the vertical axis shows the Q-value. 

Notice that the look-up table method is equivalent to a Q-net composed of two look- 
up table perceptrons (with linearly independent 6-D input state vectors) — that is, 
one perceptron for action d and the other for action u. 

Algorithm 10.2 in Section 10.6 explains implementation of the Q-net. The path 
leading to a state with the minimal Q-value had to be chosen for the next action 
because we wanted to minimize the expected costs. 

Table 10.7 presents the convergence performance; the Q-net with one hidden 
layer required considerably more iterations to converge to the desired Q-values in 
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Table 10.8, than the look-up table Q-net. Their learning rates were optimized by 
a process of trial and error. Figure 10.11 shows the Q-learning curves of the six 
non-terminal vertices obtained from the Q-net with one hidden layer (3x5x1). 


10.7.3 State Representations for Generalization 

We described a trivial minimum cost path problem, implementing TD-learning, an 
AHC concept, and Q-learning on the basis of NN function approximators. Ta- 
bles 10.4, 10.6, and 10.7 show that the networks converged quickly and provided 
outputs close to the desired values when they were set up as lookup-table per- 
ceptrons with linearly independent 6-D input state vectors. As we discussed the 
weakness of the Boxes model in Section 10.5.1, such tabular representation for the 
environment is questionable when it is applied to more realistic environments that 
entail many possible states. Thus, we have explored more compact representations 
in pursuit of generalization over input states. However, even for such a trivial prob- 
lem, when those networks had hidden layers and different input state representations 
were introduced, they clearly took many more iterations to converge. 

Notably, 2-D coordinate state representations (a) in Table 10.2 caused the rela- 
tively poor performance of the third TD-net in Table 10.3 and the AHC-net Y in 
Table 10.5, even though we varied their parameter setups repeatedly. In many of 
our trial setups, it was observed that the AHC-net Y had converged to poor action 
choices at vertex B, and had attained no precision in matching the expected out- 
puts. That is, the AHC-net Y frequently failed to inform us of the optimal path 
(i.e., vertex A — >• vertex B — >• vertex E — > vertex H). The expected values themselves 
were unrelated to such coordinate state representations. We actually explored other 
ad hoc 2-D representations of input states. One of the ad hoc 2-D representations 
was (b) in Table 10.2, wherein coordinates were simply divided by 100. Similarly, 
coordinates were normalized, or they were mapped onto unit circle inputs, and so 
on. Yet almost no significant difference in performance was obtained. The difficul- 
ties encountered in this small three-stage problem suggest that state representations 
determine the architecture of the networks, and therefore they are crucial to overall 
performance; the converged values depend strongly on input vector representations. 
We thus need to choose input state representations carefully as well as parameter 
setups (e.g., learning rates and number of neurons). 

Other issues on state representations can be found in the literature. An au- 
tonomous mobile robot should be able to deal with incorrect state descriptions 
generated by poor sensors [3, 51]. Whitehead and Ballard discussed a similar prob- 
lem in world state representations [92]; the mapping from world states to the agent’s 
state representations can be many-to-many in a complex system environment. A 
single state representation may represent multiple world states. Whitehead and Bal- 
lard called this overlapping between the world and the agent’s state representation 
“perceptual aliasing.” 
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10.8 WORLD MODELING* 

The agent is assumed to be able to observe states, actions, and reinforcement signals. 
It can therefore model the mapping from actions to reinforcement signals. More 
precisely, the world model is intended to model the input-output behavior of the 
dynamic world (or environment); given a state and an action, it is supposed to 
predict the received resultant reinforcement and next state [48, 74]. 

10.8.1 Model-free and Model-based Learning* 

From the standpoint of world modeling, reinforcement learning is roughly classified 
into the following two types: 

1. Learning optimal actions and values by sampling the world without attempt- 
ing to learn a world model 

2. Learning a world model by sampling the world, and then basing optimal 
actions and values on the learned world model. 

Method 1 is model-free or direct learning. Method 2 is model-based or in- 
direct learning. In conformity with control engineering terminology, this indi- 
rect method corresponds to a system identification procedure to form a world 
model [7, 25]. Method 2 describes a sequential training strategy whereby the world 
model is trained first and then frozen. Alternatively, it can be trained simultane- 
ously with action and value NNs [33, 69]. 

Classical DP, discussed in Section 10.4.1, is categorized as model-based learning 
because it uses world models, such as a transition model and a cost (or reward) 
model. In contrast, model-free learning of optimal actions and values can be viewed 
as modem DP (like incremental DP, discussed in Section 10.4.2), because the scope 
of DP is basically characterized by seeking the optimal value function without con- 
cern for what the final solution value is and by using a backup property of relating 
a value at a point to values at the next points. This is a modern interpretation of 
DP in the spirit of model-free learning. 

In light of our cost path problem, two agents based on methods 1 and 2 can be 
described more specifically as follows: 

• The model-free learning agent attempts to construct policy and evaluation 
functions directly with no ability to predict transitions or immediate costs 
resulting from its performed actions, and with no memory to remember more 
than one past observation explicitly. 

• The model-based learning agent either knows all relevant probabilistic infor- 
mation about transitions and costs or estimates this information through ob- 
servation to determine decisions and values by a suitable version of classical 
dynamic programming or a similar optimization technique. 
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Reinforcement 
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Agent 



Figure 10.12. Distal supervised learning; an available target value is the distal 
outcome but not the proximal action. 


10.8.2 Distal Teacher* 

In the framework of supervised learning , Werbos discussed a model-based approach 
that approximated the dynamics of the real world and showed that it augmented the 
capabilities of supervised learning [89]. Similarly, Jordan and Rumelhart demon- 
strated how distal supervised learning algorithms can be applied to an unknown 
dynamic environment that intervenes between actions and desired outcomes [41]. 
From the agent, the outcomes can be viewed as “distal” desired values because the 
agent converts reinforcement signals (called “intentions”) into actions, and then the 
environment transforms the actions into final outcomes; hence this is called “dis- 
tal supervised learning,” as illustrated in Figure 10.12. The learning agent forms 
a predictive internal model, called a “forward model,” by exploring the outcomes 
associated with particular choices of actions. The forward model outputs a pre- 
dicted reinforcement signal based on the state and the action; that is, it predicts 
the consequence of a given action in the context of a given state vector [40, 41]. In 
biological contrast, Albus [3] mentioned that “any creature with a certain sort of 
memory can hypothesize an action and receive a mental image of the results of that 
action before it is performed.” 

10.8.3 Learning Speed* 

In general, reinforcement learning based solely on TD methods is a slow process [48] 
(the direct/model-free case). To learn a world model and practice with the model 
can be useful in speeding up the learning process (the indirect /model-based case). 

Direct methods can be used as components of model-based methods [9]. Sutton’s 
Dyna architecture [73] implemented one way of blending direct and indirect meth- 
ods that retains many of the advantages of each approach [7] ; Dyna employed the 
incremental computational DP steps discussed in Section 10.4.2, sometimes taking 
the actions with the real world and sometimes taking them with the world model 
based on “relaxation planning” [73, 75]. It has been reported [73, 74] that the use 
of an internal model can dramatically speed up trial-and-error learning processes, 
and that planning a sequence of actions to achieve some goal can be done even with 
incomplete, changing, and often incorrect world models. 

Millan and Torras applied a model-based approach to a robot path-finding prob- 
lem [25]. Mitchell and Thrun [57] tested neural networks with explanation-based 



290 


Learning from Reinforcement Ch. 10 


State 

Action 

State 

Action 


State 

Action 

Task 



Figure 10.13. A modular Q-network architecture that can be identical to the CQ- 
leaming model proposed by Singh [71, 72]. 

learning [24, 56] for robot control in a discrete action environment, and sug- 
gested an advantage over Sutton’s Dyna and Jordan/Rumelhart’s distal teacher 
method. Lin compared the performance differences between model-based methods 
and model-free methods, and implemented modified “relaxation planning” algo- 
rithms inspired by Dyna [48, 50]. Lin also noticed that a sufficiently good world 
model is not easy to obtain [48]. 

10.9 OTHER NETWORK CONFIGURATIONS* 

In Section 10.7, reinforcement learning networks were expressed in the simple form of 
feedforward (multilayer) perceptrons. In this section, we describe two other impor- 
tant network configurations for realizing reinforcement learning: modular networks 
and recurrent networks. 

10.9.1 Divide-and-Conquer Methodology* 

With respect to continuous state space, the usual practice is to decompose an en- 
tire given state space into subsets and apply different evaluation functions under 
different conditions. The Boxes model in Section 10.5.1 was an example of this 
divide-and-conquer technique. When such decomposition is used, a form of gen- 
eralization can be achieved by employing an averaging process over neighboring 
subsets [8]. Parametric approximators and distributed representations such as neu- 
ral networks can provide generalization abilities to evaluate states never visited in 
past experience, as discussed in Section 10.7.3. In this perspective, modular and 
hierarchical network architectures may be applied to developing more sophisticated 
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Figure 10.14. Neural network architectures with context units: an Elman network 
(left), and a Jordan network (right). Darker-shaded arrows have fixed weights of 
1 . 0 . 


reinforcement learning systems, because such network architectures are closely re- 
lated to the task decomposition concept, as described in Section 9.6. In other words, 
splitting up a given task may reduce the agent’s average learning load to the point 
of being able to generalize to produce similar actions in similar states. 

Mahadevan and Connell [51] employed a switching mechanism in a subsumption 
architecture so that a Q-learning robot could perform new behaviors in a previously 
unknown environment. The whole reinforcement learning system had a hierarchi- 
cal behavior-based structure. Likewise, Singh [71, 72] constructed a CQ-learning 
(compositional Q-learning) architecture based on the modular network of Jacob et 
al. [37, 38]. Singh’s model learns multiple compositionally structured sequential 
tasks, whereas Jacobs’s model learns multiple non-sequential tasks in accordance 
with ordinary supervised learning. Figure 10.13 illustrates a schematic diagram 
that can be identical to Singh’s proposed CQ-learning architecture. Several Q-nets 
(local experts) were mediated by a stochastic switch connected to a gating network. 

Radial basis function networks (RBFNs) [45] and fuzzy systems [13, 12, 47] are 
also discussed in a similar way because of their local generalization abilities. (See 
also Section 18.2 in Chapter 18.) 

10.9.2 Recurrent Networks* 

This subsection hints at a conceptually important tie between reinforcement learn- 
ing and recurrent networks. The recurrence makes it possible for the networks to 
process sequential inputs. Hence, the recurrent networks may discover an intrinsic 
temporally successive structure of a given task. 

In any environment, the outcome at any given time may be affected arbitrarily 
by prior actions. Both reinforcements and states may depend arbitrarily on the 
past history of the agent’s outputs. When we use NN function approximators, their 
outputs must depend on both the current state and on some internal state that, 
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in turn, depends on the historical record. This requirement may be realized by 
NNs with context units embedded in their own architectures, such as an Elman 
network [28] and a Jordan network [39], illustrated in Figure 10.14. The context 
units implicitly encode the history of the entire past, as far back as it goes. Thus, 
this type of NN may be able to deal with the sensitivity of value and action to 
whatever in the prior history is potentially relevant to environmental dynamics and 
reinforcement. 

In Section 10.4.1, we emphasize the importance of the explicit state description 
adequate to allow a DP solution based on the principle of optimality. By compari- 
son, Elman- type networks may be able to define a state description implicitly and 
automatically, by including the appropriate amount of past history so that a real 
process under consideration is Markovian. 

Schmidhuber discussed recurrent network models for reinforcement learning [69, 
70]. Jordan and Rumelhart showed recurrent networks with forward models [41]. 


10.10 REINFORCEMENT LEARNING BY EVOLUTIONARY COM- 
PUTATION* 

10.10.1 Bucket Brigade* 

Classifier systems are message-passing production systems that learn temporal se- 
quences of string rules (called classifiers) through credit assignment based on the 
bucket brigade algorithm and on GA-based rule discovery. The bucket brigade is 
an algorithm that adjusts the strengths of the classifiers and determines which ac- 
tion should be taken. The alteration equation of classifiers’ strengths is not identical 
but is similar to the Q-value update rule in Equation (10.18) and the TD formula. 
Rule discovery is done by the GA, and thus it is stochastic in nature. Moreover, the 
GA forms plausible new classifiers through genetic operations. Detailed mechanisms 
for classifier systems are discussed in refs [19, 29, 30, 34]. 

Sutton specified a difference between the TD methods and the bucket brigade [77]; 
the bucket brigade assigns credit based on rules that activate other rules, whereas 
the TD methods assign credit based solely on temporal succession. The bucket 
brigade thus combines both temporal and structural credit assignment in a single 
rather arbitrary mechanism, although the mechanism may not necessarily correctly 
solve optimization problems. 

10.10.2 Genetic Reinforcers* 

Several researchers employed a genetic algorithm (GA) in reinforcement learning. 
Odetayo and McGregor [61] applied GA-based reinforcement learning to the pole- 
balancing control problem discussed in Section 10.5.1. As with the Boxes system 
approach, they discretized the state space into 54 regions; each region contained a 
production rule to specify an action (push left or right). A chromosome consisted 
of 54 rules each of which represented either “1” (push left) or “0” (push right). The 
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chromosomes were rated for effectiveness in balancing the pole and were subjected 
to the usual genetic operations: reproduction, crossover, and mutation. It was 
reported that the GA-based reinforcement learning required fewer iterations than 
the Boxes, AHC, and CART algorithms. 

Whitley et al. [93] investigated the same control task. They used genetic hill 
climbing to train an NN that produced action probabilities, ranging from 0.0 (push 
left) to 1.0 (push right). (For general treatment of combinations of GAs and NNs, 
good surveys are provided in refs. [68] and [98].) Fitness was determined by the 
amount of time the pole stayed balanced. Whitley et al. reported that the genetic 
reinforcement learning produced results comparable to AHC. 

Ackley and Littman [2] introduced evolutionary reinforcement learning 
(ERL), which combines genetic evolution with NN learning and an artificial life 
“ecosystem.” It is based on a hypothesis that evolution and learning progress syn- 
ergistically. More specifically, initial weights of the action and value NNs are speci- 
fied genetically; the weights of the value NN are evolved by a GA, and the value NN 
outputs are then used to train an action NN by what Ackley and Littman called 
complementary reinforcement backpropagation (CRBP) [1, 2]. CRBP, based on a 
simple TD algorithm, embodies a heuristic rule that the desired output on negative 
reinforcement is the complement of the output generated by the action NN. Their 
ERL can be viewed as an implementation of a genetic AHC concept. 

Unemi et al. [83] applied genetic Q-learning to Dyna’s navigation task [73]; Q- 
values as well as other parameters (learning rate and discounting rate) were encoded 
as binary strings. 

10.10.3 Immune Modeling* 

Immune models are inspired by biological immune system mechanisms, in which an- 
tibodies fight against antigens (intruders). Varela et al. have proposed immune net- 
works (INs) [84] that act mainly on their own rather than in the presence of antigens. 
(Perelson and Forrest discussed another genetic immune system in ref. [62].) Bersini 
and Varela formulated an immune recruitment mechanism in conjunction with GAs, 
calling the result “GIRM: Genetic Immune Recruitment Mechanism” [14, 15]. They 
also incorporated transitional proximity Q-learning [20] into the recruitment mech- 
anism. 


10.11 SUMMARY 

This chapter has covered a broad range of reinforcement learning techniques, pre- 
senting their fundamental concepts. Departing from the jackpot journey, we have 
pointed out a diversity of available structures and reviewed their learning opera- 
tions, including several evolutionary computational models. 

As we have discussed, the dynamic programming (DP) notion basically encap- 
sulates the characteristics of reinforcement learning techniques; DP is central to 
many techniques of reinforcement learning. In particular, emphasis is being placed 
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on features of totally model-free learning without recording past trials explic- 
itly, as we have seen in Q-learning. In behavioral ecology, it has been suggested that 
DP-based computations may be performed by animals for calculating their optimal 
behavioral policies [43, 52]. Interestingly enough, Montague et al. have reinforced 
this suggestion; they have constructed an NN model to simulate the behavior of a 
foraging honeybee in accordance with reinforcement learning [58]. 

In Section 10.7, we have provided a simulation example for testing TD-learning, 
an AHC concept, and Q-learning on the basis of NN function approximators. When 
the networks were designed to possess hidden layers to draw generalization abilities, 
it was difficult to get them to work even for such a small three-stage minimum- 
cost path problem. Network configurations and input state representations must 
be important for better realization. Also, some approximator other than neural 
networks may be employed for improving overall performance. 

Many of the current reinforcement learning techniques are still in the research 
stage, yielding merely satisfactory results for some problems in narrow areas. But 
the paradigms are offering great potential. Hence, ongoing explorations may lead 
them to higher performance techniques and would provide incredibly important 
mechanisms in machine learning because our learning agents know the axiom, “Fail- 
ure is the surest path to success!” 


EXERCISES 

1. Looking at Figure 10.2, specify the reason why the converged ■P ( j own is 0.5 at 
vertices B and F. 

2. Develop a computer program to simulate the signpost model to solve the jack- 
pot journey problem illustrated in Figure 10.1. 

3. Derive Equation (10.5) from Equation (10.7). 

4. During the learning process of AHC-nets in Figure 10.7, when the action NN 
learns extremely slowly compared with the value NN, what do the current 
values (produced by the value NN) show? 

5. Discuss with your classmates how to apply reinforcement learning to the well- 
known tic-tac-toe game. 
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Chapter 11 


Unsupervised Learning and 
Other Neural Networks 


J.-S. R. Jang and E. Mizutani 

11.1 INTRODUCTION 

In this chapter, we discuss various neural systems that are frequently categorized 1 
outside the class of pure supervised learning neural networks (NNs) discussed in 
Chapter 9. In particular, we concentrate on neural networks with two learning 
modes: unsupervised learning and recording learning. 

When no external teacher or critic’s instruction is available, only input vectors 
can be used for learning. Such an approach is learning without supervision, or 
what is commonly referred to as unsupervised learning. An unsupervised learning 
system (or agent) evolves to extract features or regularities in presented patterns, 
without being told what outputs or classes associated with the input patterns are 
desired. In other words, the learning system detects or categorizes persistent fea- 
tures without any feedback from the environment. Thus, unsupervised learning is 
frequently employed for data clustering, feature extraction, and similarity detection. 

Unsupervised learning NNs attempt to learn to respond to different input pat- 
terns with different parts of the network. The network is often trained to strengthen 
firing to respond to frequently occurring patterns, thereby leading to the so-called 
synonym probability estimators. In this manner, the network develops certain inter- 
nal representations for encoding input patterns. In this chapter, for unsupervised 
learning paradigms, we describe competitive learning, the Kohonen self-organizing 
feature map, and principal component analysis. 

Another mode of learning, called recording learning by Zurada [49], is typically 
employed for associative memory networks. Usually we design an associative 
memory network by recording several ideal patterns into the network’s stable states, 

1 Previous literature proposes numerous ways of categorizing neural network paradigms. Our 
NN classification originates from Barto’s taxonomy on page 222 in ref. [36], and Zurada’s classi- 
fication on page 75 in ref. [49], although both are not completely identical. See also Table 9.1 in 
Chapter 9. 
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Figure 11.1. Competitive learning network. 


and we expect the network state to reach one of those patterns when given a pattern 
(perhaps a contaminated one) as the network’s initial state. Stochastic optimiza- 
tion techniques (e.g., simulated annealing) are frequently employed for altering the 
state transition process of the network. In this chapter, we describe the Hopfield 
network as an example of recording learning systems. The network is also referred 
to as content addressable or auto- associative, capable of rectifying and recovering 
contaminated or incomplete input patterns. 

Finally, neural network learning procedures are summarized in light of a general 
formula proposed by Amari [3] . 


11.2 COMPETITIVE LEARNING NETWORKS 

With no available information regarding the desired outputs, unsupervised learning 
networks update weights only on the basis of the input patterns. The competitive 
learning network is a popular scheme to achieve this type of unsupervised data 
clustering or classification; Figure 11.1 presents an example. All input units i are 
connected to all output units j with weight W{j. The number of inputs is the 
input dimension, while the number of outputs is equal to the number of clusters 
that the data are to be divided into. A cluster center’s position is specified by the 
weight vector connected to the corresponding output unit. For the simple network 
in Figure 11.1, the three-dimensional input data are divided into four clusters, and 
the cluster centers, denoted as the weights, are updated via the competitive learning 
rule. 

The input vector x = [xi,X 2 ,xs] t and the weight vector Wj = [wij,W 2 j,wsj] T 
for an output unit j are generally assumed to be normalized to unit length. The 
activation value aj of output unit j is then calculated by the inner product of the 



Sec. 11.2. Competitive Learning Networks 


303 



Figure 11.2. Competitive learning with unit-length vectors. The dots represent the 
input vectors and the crosses denote the weight vectors for the four output units in 
Figure 11.1. As the learning continues, the four weight vectors rotate toward the 
centers of the four input clusters. (MATLAB command: compball) 


input and weight vectors: 


3 

Oj = XiWij = x r w j = w J x. 
i= 1 


( 11 . 1 ) 


Next, the output unit with the highest activation must be selected for further pro- 
cessing, which is what is implied by competitive. Assuming that output unit k has 
the maximal activation, the weights leading to this unit are updated according to 
the competitive or the so-called winner-take-all learning rule: 


„ /, | = Wfc(t) +77(x(t) - Wfc(t)) 

; || Wjfc (t) + »l(x(i)-w fc (t))|| 


( 11 . 2 ) 


The preceding weight update formula includes a normalization operation to ensure 
that the updated weight is always of unit length. Notably, only the weights at the 
winner output unit k are updated; all other weights remain unchanged. 

The update formula in Equation (11.2) implements a sequential scheme for find- 
ing the cluster centers of a data set of which the entries are of unit length. When 
an input x is presented to the network, the weight vector closest to x rotates to- 
ward it. Consequently, weight vectors move toward those areas where most inputs 
appear and, eventually, the weight vectors become the cluster centers for the data 
set. Figure 11.2 illustrates this dynamic process. 

Using the Euclidean distance as a dissimilarity measure is a more general 
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scheme of competitive learning, in which the activation of output unit j is 

/ 3 \ 0 - 5 

Oj = ( Yfei - w ij) 2 J = l|x - W J-||- (11-3) 

The weights of the output unit with the smallest activation are updated according 
to 

W k(t + 1) = W k (t) + - w *(£)). (11.4) 

In the preceding equation, the winning unit’s weights shift toward the input x. In 
this case, neither the data nor the weights must be of unit length. 

A competitive learning network performs an on-line clustering process on the 
input patterns. When the process is complete, the input data are divided into 
disjoint clusters such that similarities between individuals in the same cluster are 
larger than those in different clusters. Here two metrics of similarity are introduced: 
the similarity measure of inner product in Equation (11.1) and the dissimilarity 
measure of the Euclidean distance in Equation (11.3). Obviously, other metrics can 
be used instead, and different selections lead to different clustering results. When 
the Euclidean distance is adopted, it can be proved that the update formula in 
Equation (11.4) is actually an on-line version of gradient descent that minimizes 
the following objection function: 

E = ^ l|w/( Xp ) - x p || 2 , (11.5) 

v 

where /(x p ) is the winning neuron when input x p is presented and w/( Xp ) is the 
center of the class where x p belongs to. This fact is left as Exercise 1 at the end of 
this chapter. 

A large family of batch-mode (or off-line) clustering algorithms can be used 
to find cluster centers that minimize Equation (11.5). One of these algorithms is 
K-means clustering, as explained in detail in Chapter 15. 

A limitation of competitive learning is that some of the weight vectors that are 
initialized to random values may be far from any input vector and, subsequently, it 
never gets updated. Such a situation can be prevented by initializing the weights 
to samples from the input data itself, thereby ensuring that all of the weights get 
updated when all the input patterns are presented. An alternative would be to 
update the weights of both the winning and losing units, but use a significantly 
smaller learning rate 77 for the losers; this is commonly referred to as leaky learn- 
ing [45]. Other methods that prevent weights from not getting updated can be 
found in ref. [22]. 

Dynamically changing the learning rate 77 in the weight update formula of Equa- 
tion (11.2) or (11.4) is generally desired. An initial large value of 77 explores the 
data space widely; later on, a progressively smaller value refines the weights. The 
operation is similar to the cooling schedule of simulated annealing, as introduced in 
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Chapter 7. Therefore, one of the following formulas for r/ is commonly used: 

{ r](t) = rjoe~ at , with a > 0 , or 
rj(t) = rjot~ a , with a < 1 , or 
77 (f) = 770(1 — at), with 0 < a < (max{t}) _1 . 

Competitive learning lacks the capability to add new clusters when deemed nec- 
essary. Moreover, if the learning rate 77 is a constant, competitive learning does 
not guarantee stability in forming clusters; the winning unit that responds to a 
particular pattern may continue changing during training. On the other hand, 
77 , if decreasing with time, may become too small to update cluster centers when 
new data of a different probability nature are presented. Carpenter and Grossberg 
referred to such an occurrence as the stability-plasticity dilemma, which is com- 
mon in designing intelligent learning systems [ 8 ]. In general, a learning agent (or 
system) should be plastic , or adaptive in reacting to changing environments; mean- 
while, it should be stable to preserve knowledge acquired previously. Adaptive 
resonance theory (ART), as introduced by Grossberg [17], proposes a solution 
to this dilemma. Based on ART, Carpenter and Grossberg proposed a series of 
similar networks, including ART1, ART2 [7], ART3 [9], and ARTMAP [10]. 

If the output units of a competitive learning network are arranged in a geometric 
manner (such as in a one-dimensional vector or two-dimensional array) , then we can 
update the weights of the winners as well as the neighboring losers. Such a capability 
corresponds to the notion of Kohonen feature maps, as discussed in the next section. 

After competitive learning is finished, the input space is divided into a number 
of disjoint clusters, each of which is represented by a cluster center. These cluster 
centers axe also known as template , reference vector , or codebook vector [16, 38]. For 
an input vector, we can use the corresponding template to represent the input vector 
rather than the vector itself. Such an approach is called vector quantization 
and it has been used for data compression in image processing and communication 
systems. Section 11.4 introduces a supervised version of vector quantization, known 
as learning vector quantization [33, 34, 35]. Other competitive learning applications 
include graph bipartitioning [22, 45] and word perception models [45]. 

11.3 KOHONEN SELF-ORGANIZING NETWORKS 

Kohonen self-organizing networks [30, 31], also known as Kohonen feature 
maps or topology-preserving maps, are another competition-based network 
paradigm for data clustering. Networks of this type impose a neighborhood con- 
straint on the output units, such that a certain topological property in the input 
data is reflected in the output units’ weights. 

Figure 11.3(a) presents a relatively simple Kohonen self-organizing network with 
2 inputs and 49 outputs. The learning procedure of Kohonen feature maps is sim- 
ilar to that of competitive learning networks. That is, a similarity (dissimilarity) 
measure is selected and the winning unit is considered to be the one with the largest 
(smallest) activation. For Kohonen feature maps, however, we update not only the 
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Figure 11.3. (a) A Kohonen self- organizing network with 2 input and 49 output 
units ; (b) the size of a neighborhood around a winning unit decreases gradually with 
each iteration. 


winning unit’s weights but also all of the weights in a neighborhood around the win- 
ning units. The neighborhood’s size generally decreases slowly with each iteration, 
as indicated in Figure 11.3(b). A sequential description of how to train a Kohonen 
self-organizing network is as follows: 

Stepl: Select the winning output unit as the one with the largest similarity measure 
(or smallest dissimilarity measure) between all weight vectors w* and the input 
vector x. If the Euclidean distance is chosen as the dissimilarity measure, then 
the winning unit c satisfies the following equation: 

||x- w c || = min ||x- w<||, 

% 

where the index c refers to the winning unit. 

Step2: Let NB C denote a set of index corresponding to a neighborhood around 
winner c. The weights of the winner and its neighboring units are then up- 
dated by 

Aw i = T}(x - w i), i e NB C , 

where rj is a small positive learning rate. Instead of defining the neighbor- 
hood of a winning unit, we can use a neighborhood function Q c (i) around 
a winning unit c. For instance, the Gaussian function can be used as the 
neighborhood function: 
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(a) (b) (c) (d) 



Figure 11.4. Simulation of the Kohonen self-organizing network: (a) input data 
uniformly distributed within [0, 1] x [0.1]; (b) initial weights ; (c) weights after 30 
iterations; (d) weights after 1000 iterations. (MATLAB command: kfm(l)) 



Figure 11.5. Simulation of the Kohonen self-organizing network: (a) input data 
uniformly distributed within a triangular region; (b) initial weights; (c) weights after 
30 iterations; (d) weights after 1000 iterations. (MATLAB command: kfm(2)) 


where pi and p c are the positions of the output units i and c, respectively, and 
1 7 reflects the scope of the neighborhood. By using the neighborhood function, 
the update formula can be rewritten as 

Awj = r}Q c (i)(x — w i), where i is the index for all output units. 


To achieve a better convergence, the learning rate 77 and the size of neighborhood 
(or <t) should be decreased gradually with each iteration. Figures 11.4, 11.5, and 
11.6 present simulation results of Kohonen feature maps with different input data 
distributions; the output units are arranged in a 10 -by -10 two-dimensional mesh. 
In the simulation, 7 / and a linearly decreased with the number of iterations. 

The most well-known application of Kohonen self-organizing networks is Ko- 
honen’s attempt to construct a neural phonetic typewriter [32] that is capable of 
transcribing speech into written text from an unlimited vocabulary, with an ac- 
curacy of 92% to 97%. The network has also been used to learn ballistic arm 
movements [44]. 
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(a) (b) (c) (d) 



Figure 11.6. Simulation of the Kohonen self-organizing network: (a) input 
data uniformly distributed within a doughnut-shaped region; (b) initial weights; (c) 
weights after 30 iterations; (d) weights after 1000 iterations . (MATLAB command: 
kfm(5)) 


11.4 LEARNING VECTOR QUANTIZATION 

Learning vector quantization (LVQ) [33, 34, 35] is an adaptive data classifi- 
cation method based on training data with desired class information. Although a 
supervised training method, LVQ employs unsupervised data-clustering techniques 
(e.g., competitive learning, introduced in Section 11.2) to preprocess the data set 
and obtain cluster centers. 

LVQ’s network architecture closely resembles that of a competitive learning net- 
work, except that each output unit is associated with a class. Figure 11.7(a) presents 
an example, where the input dimension is 2 and the input space is divided into six 
clusters. The first two clusters belong to class 1, while the other four clusters belong 
to class 2. The LVQ learning algorithm involves two steps. In the first step, an un- 
supervised learning data clustering method is used to locate several cluster centers 
without using the class information. In the second step, the class information is 
used to fine-tune the cluster centers to minimize the number of misclassified cases. 

During the first step of unsupervised learning, any of the data clustering tech- 
niques introduced in this chapter and Chapter 15 can be used to identify cluster 
centers (or weight vectors leading to output units) to represent the data set with 
no class information. The number of clusters can either be specified a priori or 
determined via a cluster technique capable of adaptively adding new clusters when 
necessary. Once the clusters are obtained, their classes must be labeled before mov- 
ing to the second step of supervised learning. Such labeling is achieved by the 
so-called voting method (i.e., a cluster is labeled class k if it has data points be- 
longing to class k as a majority within the cluster.) The clustering process for LVQ 
is based on the general assumption that similar input patterns generally belong to 
the same class. 

During the second step of supervised learning, the cluster centers are fine-tuned 
to approximate the desired decision hypersurface. The learning method is straight- 
forward. First, the weight vector (or cluster center) w that is closest to the input 
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Figure 11.7. Learning vector quantization (LVQ): (a) network representation; (b) 
possible data distribution and decision boundary. [MATLAB command for (b): 
lvqdata] 

vector x must be found. If x and w belong to the same class, we move w toward 
x; otherwise we move w away from the input vector x. 

After learning, an LVQ network classifies an input vector by assigning it to the 
same class as the output unit that has the weight vector (cluster center) closest to 
the input vector. Figure 11.7(b) illustrates a possible distribution of data set and 
weights after training. 

A sequential description of the LVQ method is as follows: 

Step 1: Initialize the cluster centers by a clustering method. 

Step 2: Label each cluster by the voting method. 

Step 3: Randomly select a training input vector x and find k such that ||x — Wfc|| 
is a minimum. 

Step 4: If x and Wfc belong to the same class, update Wfc by 


Aw*. = T}(x - Wfc). 


Otherwise, update w* by 


Aw fc = -r?(x - Wfc). 

The learning rate rj is a positive small constant and should decrease with each 
iteration. 
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Figure 11.8. A simple network topology for Hebbian learning ; weight Wij resides 
between two neurons i and j . 

Step 5: If the maximum number of iterations is reached, stop. Otherwise, return 
to step 3. 

Two improved versions of LVQ are available; both of them attempt to use the 
training data more efficiently by updating the winner and the runner-up (the next 
closest vector) under a certain condition. The improved versions are called LVQ2 
and LVQ3; refs. [34] and [35] provide further details. 

11.5 HEBBIAN LEARNING 

Hebb [20] described a simple learning method of synaptic weight change. When two 
cells fire simultaneously (i.e., have strong responses), their connection strength (or 
weight) increases. Such phenomenon is the so-called Hebbian learning, where the 
weight increase between two neurons is proportional to the frequency at which they 
fire together. Among various mathematical formulas of this principle, the simplest 
one is expressed as 

A Wij = Tiyiyj, (11.6) 

where rj is the learning rate. Since weights are adjusted according to the correlation 

of neuron outputs, the preceding formula is a type of correlational learning rule. 

The ith neuron’s output yi can be regarded as an input (xi) to another neuron j 
(Figure 11.8), so Equation (11.6) can be written as 

A = rjyjXi. (11.7) 

Restated, a weight is assumed to change proportionately to the correlation of the 
input and output signals. By using a neuron function /(•), yj is given by 

Vj = /( W J x ). 

Thus, Equation (11.7) is equivalent to the following: 

AWij = T] /(wj x) X{. 


( 11 . 8 ) 
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Figure 11.9. One-layer single-output network with Hebbian learning for principal 
component analysis. 

A sequence of learning patterns indexed by p is assumed here to be presented 
to the network; in addition, all initial weights are zero. By using Equation (11.6), 
the update amount of a weight after the entire data set is presented will be 

w ij = V^yipVjp- (H-9) 

p 

Frequent input patterns have the most impact on the weights and, eventually, cause 
the network to produce the largest outputs. Applying the plain Hebbian learning in 
Equation (11.6) causes unconstrained growth of the weights. Hence, in some cases, 
the Hebbian rule is modified to counteract the unlimited growth of weights. Weight 
normalization, as described in subsection 11.6.2, is one such method. 

The rationale behind the Hebbian learning rule is most easily understood via a 
single-layer n-input one-output neural network with identity activation functions, 
as shown in Figure 11.9. The output y is equal to ]T)” =1 w i x i-> or m matrix form, 

T T 

y = w x = x w, 

where x = [®i, . . . ,x n ] T is the input vector and w = [w\, . . . ,w n ] T is the weight 
vector. The corresponding Hebbian learning rule is 

Aw = rjyx. (11.10) 

The preceding learning rule is an on-line steepest-descent scheme that minimizes 
the objective function 

J = ( lln ) 
p p 

where x p and y p are the pth. input and corresponding desired output, respectively. 
If w is a unit vector, then x p w is the projection of the vector x p onto the direction 
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specified by a vector w. The minimization of J implies finding a unit-length w that 
most accurately represents the direction of the entire data set. 

Other learning algorithms (such as in the Hopfield network learning and super- 
vised correlation learning) often reflect the Hebbian learning principle [39]; they 
share the correlational property in their formulas. 

11.6 PRINCIPAL COMPONENT NETWORKS 

This section describes a single-layer single-output network with Hebbian-type learn- 
ing that can be used for extracting principal components (eigenvectors correspond- 
ing to the largest eigenvalues) of the correlation matrix of the training vectors. 

11.6.1 Principal Component Analysis 

An important issue in pattern recognition (or data classification) is to select features 
(or inputs of the training vectors) that have more discriminant power and use the 
selected features as inputs to a recognition or classification scheme. This feature 
selection process is essential for a real-world data set, in which the number of 
features generally exceeds 10. Hence we need to determine their priorities and feed 
the important features to our classification system for effective training. 

Similar input patterns likely belong to the same class. Under this observation, 
the input variables can be normalized to within the unit interval and then selected 
according to their variances. That is, the larger the variances, the more likely that 
the inputs variables have better discriminant powers. 

For some data sets, combining two features might yield a better discriminant 
power than either one alone. Therefore, the data set must occasionally be trans- 
formed into a more recognizable or trainable form. Principal component anal- 
ysis (PCA) [41] is one approach to combining inputs linearly and identifying their 
priorities; it is also known as Karhunen-Loeve transformation [41] in commu- 
nication theory and image processing. 

Let x», i = 1, . . . ,n, be the «th entry of the data set under consideration. A 
unit vector u is to be found such that the variance of the data set, after projecting 
onto u, is maximized. Without loss of generality, the data set is assumed to be zero 
mean: 

n 

5Z X i = ° 

i= 1 

The projection of x* onto u is defined by the inner product: 

Pi = x* • u = xf u = U T Xi, 
subject to the constraint that u is a unit vector: 
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Since x* is zero mean, so is u: 

2 uTx * = uT ^ x » = uT • 0 = °- 

i=l i=l i=l 

The square of pi can be expressed as 

p\ - (u T Xi)(xfu) = u T (xixf)u. 

Therefore, the projection pi s variance is 

®J(“) = sEiiP? 

= uT (sDr=i x i x i’) u 
— u T Ru, 

where the symmetric matrix R is the data set’s correlation matrix. [If the data 
set is not zero mean, then R is called the covariance matrix and is defined as 

Ti £i=i( x i — m)( x * — aO , where n is the mean of x*, i = 1, . . . , n.] 

The projection variance <7p(u) subject to the unit- vector constraint can be min- 
imized by defining a new objection function using the Lagrange multiplier: 

J = u t Ru + A(1 — u T u). 

Differentiating the preceding equation and setting it to zero yield 

V u </ = 2Ru — 2Au = 0, 
or 

Ru = Au. 

This stationary condition implies that A is an eigenvalue of the correlation matrix 
R and u* is the corresponding vector. At the preceding stationary condition, the 
projection variance is 

<7p(u) = u t Ru = u t Au = Au t u = A. 

Therefore, the projection variance (jp(u) has a maximum equal to the largest eigen- 
value of the correlation matrix R; this occurs when the projection vector u is equal 
to the corresponding eigenvector. 

Note that a correlation matrix is symmetric and its eigenvectors are orthogonal 
to each other. A given vector x can be expressed using the n eigenvectors of R: 

n 

X = 

i= 1 
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Figure 11.10. Principle of orthogonality in principal component analysis, in which 
x is equal to x + e and the truncated version x is always orthogonal to the approx- 
imation error e. 

where qi (= x • u») is the projection of x onto u*, and u* is the tth unit eigenvector 
of R. Index i is ordered in such a manner that u* belongs to the tth eigenvalue A*, 
i = 1 , . . . , n, satisfying the following constraint: 

^1 ^ ^ ^ ‘ ^ A n . 

If we want to do dimensionality reduction by retaining m inputs with larger vari- 
ances, x can be approximated by x by deleting n—m terms containing u m +i , . . . , u n : 

m 

x = y^pjUj. 

*=i 


The approximation error is given by 

n 

e = X - X = Pi u i • 

£=m+l 

Since u* is orthogonal to each other, the error vector e is orthogonal to the approx- 
imating data vector x, regardless of the value of m. This is called the principle of 
orthogonality, as graphically represented in Figure 11.10. 

If x is taken to be a random variable and xi , . . . , xi are its instances, then the 
variance of x is the sum of the variances of x and e. More specifically, the variance 
of x is expressed as 


= Ai H- - * * H- A m -I- A m +i -I- • • • -I- An, 


( 11 . 12 ) 


<rl 


where o\ and o\ are the variances of x and e, respectively. The preceding equation 
can be proved using the orthogonality among u*; this is left as an exercise. 

Therefore, to achieve dimensionality reduction on a data set, the correlation 
matrix and its eigenvalues and eigenvectors must be found first. Next, the data set 
is projected onto the subspace spanned by the eigenvectors belonging to the largest 
eigenvalues. 
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Figure 11.11. Comparison between principal component analysis and regression 
analysis. The direction of the first principal component minimizes Yhi Li, the dis- 
tance in a direction perpendicular to the estimated line. Meanwhile, the regression 
line minimizes ^2 i li, the vertical distance from data points to that line. 


PCA can also be interpreted in terms of data fitting that minimizes the lengths 
of error vectors perpendicular to the estimated line or surface. As indicated in 
Figure 11.11, P* is the location of the *th data point x* and M is the location of 
the sample mean of all the data points. The thick solid line represents the direction 
of the first principal component passing through the data mean point M, and Hi 
denotes the projected point of P* onto the thick line. The principal of orthogonality 
(see Figure 11.10) implies that the vector from M to Hi is orthogonal to the vector 
from Pi to Hi. This leads to the following identity: 

WF? = MH? + Pjfi ■ 

Summing up the preceding identity over all data points yields 

= + T.K h? ( 1L13 ) 

i i i 

In the above equation, the left-hand side is a constant; the first term of the right- 
hand side is proportional to the variance of principal components; the second term 
is Yli Li, the sum of the distances measured in a direction perpendicular to the 
estimated line. Therefore, maximizing the variance of principal components (first 
term) leads to minimizing Yli Li (second term), the sum of the squares of the 
distances from the points to the line, where the distances are measured in a di- 
rection perpendicular to the estimated line. Minimizing ^ Li is the task of the 
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total least-squares (TLS) method [15] for data fitting; the first principal compo- 
nent direction [5, 29, 47] can be used to achieve the same goal. In comparison, the 
standard least-squares (LS) method, as described in Chapter 5, attempts to find a 
least-squares regression line that minimizes the vertical distance (£V H) from the 
data points to the line, as indicated in Figure 11.11. 

11.6.2 Oja’s Modified Hebbian Rule 

By using the simple neural network in Figure 11.9, Oja demonstrated that with a 
Hebbian-type learning rule, the network performs PC A [42]. 

Employing the plain Hebbian learning rule in Equation (11.10) leads to unlimited 
growth of the weight vector. One solution is to renormalize the weight vector after 
each update: 


Wi(t + 1 ) = 


Wi(t)+r]y(t) Xi(t ) 


(11.14) 


\/£ILiM0 + v y(t) Xi(t)] 2 ’ 

If the learning rate rj is small, the preceding equation at rj = 0 can be expanded 
using Taylor series expansion. Deleting the second- and higher-order terms yields 


Amt = rjyXi - r/y 2 Wi = Tjy(xi - ywi). 


(11.15) 


The preceding equation is the modified Hebbian learning rule, which entails 
adding a weight decay proportional to the squared output (y 2 ) to maintain the 
weight vector unit length automatically. This Hebbian-type adaptation involves less 
computation since the normalization operation in Equation (11.14) is not required. 
Hertz et al. pointed out that Equation (11.15) resembles reverse LMS learning 
(Table 11.4 in Section 11.8), in which the weight updating is based on the difference 
between the actual input and the backpropagated output [23]. 

Oja also indicated that the weight vector w approaches an eigenvector of the 
correlation matrix R with the largest eigenvalue. The larger the eigenvalue, the 
more precise the direction of the corresponding principal component (or eigenvec- 
tor). Oja later extended the formulation for multiple-output systems to perform 
PC A [43]. 


11.7 THE HOPFIELD NETWORK 

In 1982, Hopfield proposed the so-called Hopfield network, which possesses auto- 
associative properties. It is a recurrent (or fully interconnected ) network in which 
all neurons are connected to each other, with the exception that no neuron has 
any connection to itself. In the network configuration, he embodied the physical 
principle , and set up an energy function. The concept derives from a physical 
system [25]: 

Any physical system whose dynamics in phase space is dominated 
by a substantial number of locally stable states to which it is attracted 
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Figure 11.12. Images of 1-D and 2-D energy terrains configured by three attractors 
in a Hopfield network. The dots denote stable states where patterns are memorized. 
The network state moves in the direction of the arrows to one of the attractors, 
which is determined by the starting point (i.e., the given input pattern). 


can therefore be regarded as a general content-addressable memory. The 
physical system will be a potentially useful memory if, in addition, any 
prescribed set of states can readily be made the stable states of the system. 

The Hopfield net highlights a content-addressable memory and a tool for solving 
an optimization problem. These features axe discussed next. 

11.7.1 Content-Addressable Nature 

The Hopfield net normally develops a number of locally stable points in state 
space. Because the network’s dynamics minimize energy , other points in state space 
drain into the stable points (called attractors or wells), which axe (possibly local) 
energy minima. 

The Hopfield net realizes the operation of a content-addressable (auto-associative) 
memory in the sense that newly presented input patterns (or arbitrary initial states) 
can be connected to the appropriate patterns stored in memories (i.e., attractors or 
stable states). The presented input pattern vector cannot escape from a region, 
what we call a basin of attraction, configured by each attractor (Figure 11.12). 
Restated, the network produces a desired memorized pattern in response to the 
given pattern. The initial network state is assumed to lie within the reach of the 
fixed attractors’ basins of attraction. That is, the entire configuration space is di- 
vided into different basins of attraction. Notably, an attractor is regarded as a fixed 
point if it is unique in state space. In general, however, an attractor may have 
chaos or the limit cycle of a periodic sequence of states. An advantage of neural 
networks (NNs) lies in their fault tolerance — more specifically, in that NNs axe tol- 
erant of a presented pattern’s slight distortions. This NN feature is appropriate for 
performing the primary task of content-addressable memory. 
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In terms of storage capacity, the number of memories is estimated to be nearly 
10% to 20% of the number of neurons in the Hopfield net [12]. Although the Hopfield 
memory net is not very efficient, its mechanism based on the energy concept is worth 
exploring. The basic binary Hopfield net is first described. 

11.7.2 Binary Hopfield Networks 

Formulation 

Each interconnection has a weight (or connection strength), denoted by from 
neuron j to neuron i. The Hopfield network considers bidirectionality in the con- 
nections, using the symmetric weight matrix, Tij = Tji, and also assumes that no 
neuron is connected to itself ( Ta = 0). In Hopfield ’s early analysis, each neuron has 
a binary state of either 0 or 1; those neurons subsequently form the binary Hopfield 
net. At each moment, the entire state of the network can be represented by a binary 
state vector. The neurons are assumed to be threshold logic units so that a state 
has one of the two possible values. The following is the firing rule of an arbitrary 
neuron i : 

v ._n if T.friTijVj > V ( , 

‘ l 0 if T, m TijVi <Ui. 

where V* denotes the output of neuron i and U{ its threshold. This can be rewritten 
with a neuron function, /(•): 


Vi = ft£Uj*i T » v i - tf<)- 

where f(x) = sgn(x) = ( J * * > 

Each network state has an associated energy 2 in a quadratic form: 

e = -I y.'ewv + n v ‘ u >- 


(11.16) 


(11.17) 


The change in energy (A E) with respect to the change of state at neuron i (A Vi) 
is derived from Equation (11.17): 


A B = -AV)(2 TijVj - Ui). (11.18) 

j^i 

When a neuron i alters its state according to the firing rule defined in Equa- 
tion (11.16), it is thus ensured that A E is always negative. In other words, E 
is a monotonically decreasing function of the network state. 

2 The energy in Equation (11.17) has a quadratic form similar to the kinetic energy Ek of a 
mass m at velocity v: 

1 o 

Eu = - m v . 

2 

In many cases, physical energy can be represented in such a quadratic form. 
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Table 11.1. Transitional behavior per state in the two-neuron Hopfield net. Acti 
denotes the activation of neuron i. 


Current 

states 

Energy 

levels 

Vi 

v 2 

Chosen 

neuron 

Acti 

Act 2 


Vi 

V 2 

Next 

state 

A 

0 

0 

0 

1 

-0.7 

— 


0 

0 

A 





2 

— 

0.1 


0 

1 

B 

B 

-0.1 

0 

1 

1 

-0.4 

— 


0 

1 

B 





2 

— 

0.1 


0 

1 

B 

C 

0.7 

1 

0 

1 

-0.7 

— 


0 

0 

A 





2 

— 

0.4 


1 

1 

D 

D 

0.3 

1 

1 

1 

-0.4 

— 


0 

1 

B 





2 

— 

0.4 


1 

1 

D 


Skiing Down the Energy Slope toward a Stable Well 

This subsection presents the operational picture of the binary Hopfield net to 
understand more thoroughly u state transitions ” with a fixed set of weights. The 
network can be seen as a dynamical system moving through a sequence of states 
toward a stable state over time. For computational simplicity, we describe a small 
two-neuron binary Hopfield net illustrated in Figure 11.13(a). We consider a par- 
ticular weight parameter setup as follows: 

T\2 = T 21 = 0.3, U\ = 0.7, U 2 = 0.1. 

Recall that each neuron can have one of two states. Hence, this two-neuron binary 
network can have a total of four states: state A, state B, state C, and state D, cor- 
responding to the four pairs of (Vi , F 2 ): (0, 0), (0, 1), (1, 0), and (1, 1), respectively. 
According to Equation (11.17), 


E = -T l2 ViV 2 + VlUi + V 2 U 2 = -0.3FiF 2 + 0.7Fi + 0.1V 2 . 


Using the preceding equation, we can calculate the energy levels of the four states: 
E = 0 for state A, E = —0.1 for state B, E = 0.7 for state C, and E = 0.3 for 
state D. 

In light of Equation (11.16), we consider two activations: 

Acti = T 12 V 2 -Ui= 0.3F 2 - 0.7, 


Act2 = T 2 iVi — U 2 — 0.3Vi + 0.1. 

For state A (Vi = V 2 = 0), Acti at neuron 1 is negative (—0.7), and Act 2 at 
neuron 2 is positive (0.1). Therefore, if neuron 2 starts firing, state A transits 
to state B (Vi = 0, U 2 = 1). If neuron 1 starts firing, however, state A 

does not transit. Table 11.1 summarizes all state transitions and energy levels. 
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(a) 



Figure 11.13. The simplest two-neuron Hopfield net that has one stable state: 
(a) network topology, and ( b ) four possible states with their corresponding energy 
levels and state transitions with their transitional probabilities. 


To delineate the complete state transitions, we need to mention asynchronous 
updating, which Hopfield originally employed in 1982. In this operation, a single 
neuron is randomly selected to modify its output for a given time, occasionally 
modifying it and occasionally leaving it the same. Any one of the neurons has 
a roughly equal probability of firing at each moment. The updating of neurons 
continues until no more changes can be made. Many updates must be applied to 
all neurons before the network eventually settles in a stable configuration. 

Figure 11.13(b) displays the state transition diagram of the two-neuron binary 
Hopfield net with the previously specified weight setup. (More detailed discussion 
of the algorithms and mathematical formulas for calculating weights to store pat- 
terns can be found in refs. [2, 23, 49, 19, 18].) Whatever its starting state is, the 
network eventually settles in state B rather than transits from state to state. Of 
course, increasingly complicated networks with more than two neurons may develop 
many stable states (attractors or wells) to memorize several patterns. Therefore, 
spurious/false states must be dealt with [23, 19]. Due to the concept of physical 
energy, the successive firing can be interpreted as a sloping down of the energy sur- 
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Table 11.2. The comparison between asynchronous updating and continuous up- 
dating. 


Network type 

Binary 

Continuous valued 

Updating 

Asynchronous 

Continuous 

Neuron 
function f(x) 

Threshold logic unit 
[e.g., sig(z)] 

Sigmoid 

[e.g., *(l + tanh( ,,*„))] 

Description 

Update only one randomly 
selected neuron’s output, 
according to Equation (11.16) 

Update continuously and 
simultaneously all neurons’ 
outputs toward the values 
given by Equation (11.19), as 
well as the net input net*; 
see Equations (11.20) and (11.21). 


face. The energy decreases until it reaches a (possibly local) minimum. Aleksander 
and Morton discussed more “memory” examples in ref [2]. 


11.7.3 Continuous- Valued Hopfield Networks 


Hopfield extended the binary net to the continuous- valued net in 1984 [26]. The 
neuron generates a continuous range of outputs. In the binary net, a neuron function 
/(•) is typically stipulated by a hard threshold function [Equation (11.16)]. In the 
continuous- valued Hopfield net, on the other hand, a sigmoidal function is typically 
employed as a neuron function /(•). Updates in time are described continuously 
(continuous updating). For instance, the neurons are governed by the following 
firing rule: 


Vi = f(neti ) = \ (1 + tanh(f^)) 
neti = Y2j^i Tij Vj — Uiy 


(11.19) 


where gain corresponds to the slope of the sigmoid. 

By using a neuron function /(•), the network state is governed by the differential 
equation: 

K i ^ = -V i + l(Y t T ij V j -U i ), (11.20) 

3 


where K{ is a positive constant. The term dVi/dt represents a directional vector 
toward an attractor, where dVi/dt = 0. It can be denoted by an arrow, as in 
Figure 11.12. At an attractor, the net input neti to neuron i should be equal to 
TijVj —Ui to have an ideal output: Vi(= f(Y2j TijVj — Ui)). Hence, a dynamical 
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equation similar to Equation (11.20) can be derived with respect to neti : 

k dneti = _ neU -V> = - neti + £ ^{(netj) - U h (11.21) 

3 3 

where ki is a positive constant. We have discussed so far two modes of updating: 
asynchronous scheme and continuous one, which axe summarized in Table 11.2. 
Another possible scheme is synchronous updating, wherein at each instant of 
time, all neurons’ outputs axe simultaneously set to the values given by /(•) [4, 40]. 
The network may go to the correct memorized pattern after the first updating. 

The foregoing dynamics controlled by Equations (11.20) and (11.21) can be 
found in haxdware systems. We show an electrical Hopfield model in the next 
subsection. Moreover, we describe the application of the continuous net to the 
traveling salesperson problem. 


Electrical Implementation 

This subsection discusses an analog electrical circuit to implement the continuous- 
valued Hopfield net [18, 23, 26, 49]. Figure 11.14 illustrates a circuit diagram that 
is composed of resistors, capacitors, and nonlineax amplifiers. The ith amplifier 
produces an output voltage Vi given by /(it* ), where m is the input voltage and 
function / is such a differentiable activation function as defined in Equation (11.19). 
The conductances 1/Rij function as the connection weights Tij. To allow weight 
values to be negative , resistors R{j axe connected to —Vi. The sign of weights 
is determined by selecting the positive or negative output of amplifier j. Such 
selection is realized by an additional inverting amplifier and a negative signal wire, 
which is omitted for simplicity sake in Figure 11.14, where is any other external 
input current (or bias ) to amplifier i. 

Figure 11.15 illustrates the ith electrical neuron in the Hopfield circuit. From 
this figure, the circuit equation can be obtained by employing Ohm’s law and Kir- 
choff’s current law. By considering the total current entering capacitor Ci, we have 


r* dm 
C ' i dt 


This can be rewritten as 


wS Vj ~ Ui>j + (“S') + 7i 

“ ^3 + j;)‘ Ui + Ii - 


du 


= -GiUi + Yl Ti > v i +Ii = ~ GiUi + E TijfiM + J i- 


( 11 . 22 ) 


where 


1 


1 


1 
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Figure 11.14. Diagram of an electrical circuit to implement a continuous Hopfield 
network. 



Figure 11.15. The ith electrical neuron in the Hopfield circuit in Figure 11.14- 

Dynamical Equation (11.22) of the electrical system corresponds equivalently to 
Equation (11.21); U corresponds to Ui in Equation (11.21). (The functional equiv- 
alence in terms of energy is well treated in refs. [18, 23, 26, 49].) Thus, the circuit 
depicted in Figure 11.14 functions as a Hopfield net. 
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Table 11.3. The network state matrix to represent a specific tour of five cities — C3 
-» Ci — »■ C5 — y C 2 — > C4 — for our five-city traveling salesperson problem. 


City 

name 


Position in tour 


2 

3 

4 

5 

Ci 

0 

1 

0 

0 

0 

c 2 

0 

0 

0 

1 

0 

Cs 

1 

0 

0 

0 

0 

C 4 

0 

0 

0 

0 

1 

c 5 

0 

0 

1 

0 

0 

Tour 

C 3 

~cT 

c 5 

c 2 

~CT 


Zurada describes a variety of electrical circuit models in ref. [49]. Some re- 
searchers implemented other Hopfield hardware models ( e.g., Farhat et al. dis- 
cussed optical implementation of the binary Hopfield net [13]). 


11.7.4 Traveling Salesperson Problem 

The Hopfield net can be applied to solving combinatorial optimization problems. 
This subsection discusses a well-known traveling salesperson problem (also discussed 
in Chapter 7). Hopfield yielded a good solution that is close to the optimal solu- 
tion [27]. Although practical limitations for this problem are known, designing 
procedures of the Hopfield net approach are informative and instructive. We first 
discuss how to design the Hopfield network for the given problem, and then briefly 
describe how to design the energy function. 

For simplicity, we consider a small map consisting of only five cities. A sales- 
person would like to start at a certain city, visit each of the other four cities only 
once, and then return to the first city; such a path is referred to here as a tour. In 
what order should the salesperson visit the five cities to minimize the total traveling 
distance? We discuss how to apply the Hopfield net to this five-city problem. 


Representation 

We first represent this problem in the network frame; each neuron corresponds 
to a city with an order to visit. Since each neuron’s output ranges from 0.0 to 1.0, 
the two extreme values correspond to whether to visit or not to visit. If a city is 
to be visited the fourth, for example, the city can be represented by the binary 
vector ( 0, 0, 0, 1, 0). The excitatory neuron if is interpreted here as 11 city Ci to 
be visited the jth” For this five-city map, we can construct the Hopfield net that 
consists of 25 (= 5 x 5) neurons. This means that a tour can be represented by 
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a (5 x 5) state matrix V s of neuron’s outputs Vij : 


v u 

Vl2 

Vl3 

Vl4 

«15 

V21 

V22 

V23 

V 2 4 

V25 

V 3 1 

V32 

V33 

V34 

V35 

V 4 1 

V42 

V43 

V44 

V45 

V 5 1 

V 5 2 

V53 

V54 

V55 


(11.23) 


The columns of the state matrix Y s signify different positions in a tour, and the 
rows identify the cities. By convention, the first index on an element (output 
of neuron ij) denotes its row, the second index its column. Assume that the five 
cities should be visited in the following order: 


C 3 =» Ci =» C 5 =» C 2 =» C 4 . 


For this specific tour, we expect to have the following state matrix as a result (as 
presented in Table 11.3): 


V s = 


' 0 
0 
1 
0 
0 


1 0 
0 0 
0 0 
0 0 
0 1 


0 0 
1 0 
0 0 
0 1 
0 0 


(11.24) 


Energy Function 

In constructing the energy function, we need to express clearly all constraints and 
the goal of the given problem. 


• Constraint 1 : The salesperson cannot visit more than one city at the same 
time. 


• Constraint 2: The salesperson visits each of the cities only once. 

• Goal: The salesperson would like to minimize the total traveling distance. 

Constraint 1 indicates that each column of matrix V s has all zeros except for 
one element. For a fixed column j, we have 


ey = (vij + v 2 j + v 3j + Vij + «5j - l) 2 , 
= (Ei=l Vij - l) 2 - 

For all columns, we have 

= X>u = - i) 2 - 

j — 1 j—l i — 1 


(11.25) 


(11.26) 



326 


Unsupervised Learning and Other Neural Networks Ch. 11 


Constraint 2 indicates that each row of matrix V s has all zeros except for one 
element. Considering a fixed row i, we have 

e?i = (vu + Vi2 + V,3 + Vii + Vis - l) 2 . 

For all rows, we have 

5 5 5 

E 2 = Y^e 2i = - l) 2 . 

i= 1 i — 1 j = 1 

To deal with the goal, we let L(a,b ) be the distance between city a and city b. 
For instance, L(l,2) denotes the distance between two cities: C\ and C 2 . If C\ 
is visited mth, then C 2 should be visited either (m — l)th or (m 4- l)th. Due to 
Vij = 0 or 1, we can have 

£(1,2) = £(1,2) Vi >ro V 2 ,m-1 + £(1,2) l»i, m l>2,m+l* 

In a state presented in Table 11.3, we have £(1,2) = 0, because the two cities C\ 
and C 2 are not visited consecutively. By generalizing the preceding equation, the 
distance sum to be minimized is given by 

Es = ^5Z^^£(», j) v ik (vj,k-i +Vj, k +i) i,j = 1,2, 3, 4, 5. (11.29) 

k i j 

We then obtain the total energy function in the following form: 

E — k\ E\ + &2 E 2 + A?3 F/3, (11.30) 

where ki are constant values. Weights (X^) and thresholds (Ui) can be determined 
by equating terms between Equations (11.17) and (11.30). Further details about 
calculating weights can be found in refs. [11, 27, 49]. Ideally, the states with the 
shortest tour distances have the lowest energy values. 

It is known, however, that this approach has practical limitations for larger 
problems [6, 21, 28, 48]. Many researchers have discussed how to improve the per- 
formance of the Hopfield net. Because the procedure of altering states is performed 
in an irreversible fashion, the stability can be purely local in the Hopfield net. 
Stochastic optimization techniques can be employed for overcoming this limitation, 
as discussed in the next subsection. Genetic algorithms are applicable to tuning 
parameters ki in Equation (11.30) [37]. For solving & job shop scheduling problem , 
the Hopfield-like net with some stochastic nature has been addressed in ref. [14]. 

11.7.5 The Boltzmann Machine 

To overcome the limitation stemming from the irreversible state-altering manner, 
simulated annealing or (S A) can be incorporated; the upgraded network is known 


(11.27) 


(11.28) 
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as the Boltzmann machine, which is considered to be the Hopfield model with a 
stochastic nature of learning (Chapter 7). More specifically, the firing rule of the 
binary network may be governed by the following probability function: 

Prob(Vi — 1) — j _|_ eX p (—neti/Temp) 

neti = TijVj — Ui 

Prob(V* = 0) = 1 - Prob (Vj = 1), 

where T emp is a temperature parameter having an effect on the firing probability. 
Hence, this updating does not necessarily direct the network state toward minimiz- 
ing energy. In other words, noise is introduced to the Hopfield learning mechanism 
to “shake” the network state out of a local minimum. 

In addition, the Boltzmann machine possesses the hidden units, thereby permit- 
ting weight adjustments by supervised learning of a stochastic form to minimize the 
difference between the energies of given states and their desired energies [1, 24, 46]. 

11.8 SUMMARY 

This chapter presents neural networks with two learning modes: unsupervised and 
recording learning. Unsupervised learning is useful for analyzing data without de- 
sired outputs; the networks evolve to capture density characteristics of a data set. 
Recording learning is employed for designing associative memories; it can be used 
for combinatorial optimization problems. As a summary of these discussions, we 
tabulate a compendium of neural network learning formulas in Table 11.4, as ex- 
plained next. 

Amari indicated that most learning rules are generally formulated as follows [3]: 

Awj = ijrx, (11.31) 

where r is a learning signal function. The function r depends on whether the teacher 
(or target) signal t is available or not: 

r = r(x, w, y) for supervised learning, 
or 

r = r(x, w) for unsupervised learning. 

Equation (11.31) corresponds to the Widrow-Hoff rule when r = y — w T x as dis- 
cussed previously in Chapter 9. For the Hebbian learning rule, the learning signal r 
is simply equivalent to the neuron’s output. 

Recall the winner-take-all learning rule in Equation (11.2). Because this rule 
does not conform to Equation (11.31), for our convenience, the learning signal 
vector is defined here as 


Awj = ^(learning signal vector) 
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Table 11.4. Typical neural network learning formulas, with t the target output, y 
the neural network’s output, x the input vector, and w the weight vector. 


Learning algorithm 

Learning signal vector 

Learning mode 

Hebbian 

yx 

Unsupervised 

Correlation 

tx 

Supervised 

Winner-take-all 
(or competitive) 

x — w 

Unsupervised 

Outstar 

t-Wi 

Supervised 

Perceptron 

{t — sgn(w T x)}x 

Supervised 

Oja’s 

(reversed LMS) 

(x - y w)y 

Unsupervised 

LMS (least mean square) 
or Widrow-HofF 

(t — w T x)x 

Supervised 

Delta 

{( t-y ) f'(net)}x 

Supervised 


[The learning signal vector corresponds to rx in Amari’s formulation; see Equa- 
tion (11.31).] By using the learning signal vector, Table 11.4 summarizes typical 
neural learning rules. 


EXERCISES 

1. Prove that Equation (11.4) implements an on-line gradient descent scheme for 
minimizing the objection function in Equation (11.5). 

2. Modify the MATLAB program kfm.m to generate Figures 11.4, 11.5, and 11.6 
such that the output units axe organized in a 1-D vector instead of a 2-D array. 

3. Prove that the plain Hebbian learning rule in Equation (11.10) implements 
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an on-line gradient descent scheme for minimizing the objection function in 
Equation (11.11). 

4. Derive Equation (11.15) from Equation (11.14) using the Taylor series expan- 
sion under the assumption that the learning rate tj is small. 

5. Use the orthogonality among u* to prove Equation (11.12). 

6. Consider the application of the Hopfield net to the well-known knapsack 
problem, where the objective is to maximize the total value of n objects (o{) 
put in a knapsack subject to a maximum weight constraint, C. This problem 
may be summarized mathematically as follows: 

n n 

max ^ V{Oi subject to ^ < C, 

i i 

where Vi denotes the ith object’s value, and wj* the ith object’s weight. To solve 
this knapsack problem, set up the energy function by following the procedure 
described in Section 11.7.4. 
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Neuro- Fuzzy Modeling 




Chapter 12 


ANFIS: 

Adaptive Neuro- Fuzzy 
Inference Systems 

J.-S. R. Jang 


12.1 INTRODUCTION 

The architectures and learning rules of adaptive networks have been described in 
the previous chapter. Functionally, there are almost no constraints on the node 
functions of an adaptive network except for the requirement of piecewise differen- 
tiability. Structurally, the only limitation on the network configuration is that it 
should be of the feedforward type (in some cases after unfolding) if we do not want 
to use the more complex asynchronously operated model. Because of these mini- 
mal restrictions, adaptive networks can be employed directly in a wide variety of 
applications of modeling, decision making, signal processing, and control. 

In this chapter, we propose a class of adaptive networks that are functionally 
equivalent to fuzzy inference systems. The proposed architecture is referred to as 
ANFIS, which stands for adaptive network-based buzzy inference system or 
semantically equivalently, adaptive neuro fuzzy inference system. We describe 
how to decompose the parameter set to facilitate the hybrid learning rule for ANFIS 
architectures representing both the Sugeno and Tsukamoto fuzzy models. We also 
demonstrate that under certain minor constraints, the radial basis function network 
(RBFN) is functionally equivalent to the ANFIS architecture for the Sugeno fuzzy 
model. The effectiveness of ANFIS with the hybrid learning rule is tested through 
four simulation examples: Example 1 models a two-dimensional sine function; Ex- 
ample 2 models a three-input nonlinear function that was used as a benchmark 
problem for other fuzzy modeling approaches; Example 3 explains how to identify 
nonlinear components in an on-line control system; and Example 4 predicts the 
Mackey-Glass chaotic time series. The results from ANFIS are compared exten- 
sively with connectionist approaches and conventional statistical methods. More 
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fi = Pi* 


f 2 = p 2 x+q 2 y+r 2 


Wf f i + W2 1 2 

w 1 + w 2 


= w i f 7 * w 2 f 2 



Figure 12.1. (a) A two-input first-order Sugeno fuzzy model with two rules; (b) 
equivalent ANFIS architecture. 


ANFIS applications for a number of different domains axe described in Chapter 19. 

Note that similar network structures were also proposed independently by Lin 
and Lee [21] and Wang and Mendel [40]. 

12.2 ANFIS ARCHITECTURE 

For simplicity, we assume that the fuzzy inference system under consideration has 
two inputs x and y and one output 2 . For a first-order Sugeno fuzzy model [34, 37, 
38], a common rule set with two fuzzy if-then rules is the following: 

Rule 1: If x is A\ and y is Bi, then /1 = p\x + q\y + n, 

Rule 2: If x is A 2 and y is B 2 , then /2 = P 2 X + q 2 V + 7 * 2 - 

Figure 12.1(a) illustrates the reasoning mechanism for this Sugeno model; the corre- 
sponding equivalent ANFIS architecture is as shown in Figure 12.1(b), where nodes 
of the same layer have similar functions, as described next. (Here we denote the 
output of the ith node in layer l as Oj,*.) 

Layer 1 Every node i in this layer is an adaptive node with a node function 


0 M = p Ai (x), for * = 1,2, or 

Oi,i = pBi-M, for * = 3,4, 


( 12 . 1 ) 
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where x (or y) is the input to node i and Ai (or Bi- 2) is a linguistic label (such 
as “small” or “large”) associated with this node. In other words, Oi,i is the 
membership grade of a fuzzy set A ( = Ai , A 2 , B\ or B 2 ) and it specifies the 
degree to which the given input x (or y) satisfies the quantifier A. Here the 
membership function for A can be any appropriate parameterized membership 
function introduced in Section 2.4.1, such as the generalized bell function: 


Ha(x) = 


1 + 


X — Ci 


2b 


( 12 . 2 ) 


where {a*, c*} is the parameter set. As the values of these parameters 

change, the bell-shaped function varies accordingly, thus exhibiting various 
forms of membership functions for fuzzy set A. Parameters in this layer are 
referred to as premise parameters. 


Layer 2 Every node in this layer is a fixed node labeled n, whose output is the 
product of all the incoming signals: 


0 2 ,i =Wi= fi Ai (*/), * = 1,2. (12.3) 

Each node output represents the firing strength of a rule. In general, any 
other T-norm operators (Section 2.5.2) that perform fuzzy AND can be used 
as the node function in this layer. 

Layer 3 Every node in this layer is a fixed node labeled N. The ith node calculates 
the ratio of the ith rule’s firing strength to the sum of all rules’ firing strengths: 

'll) ‘ 

0 3 ,i = w 4 = , i = 1, 2. (12.4) 

W 1 -I- U>2 

For convenience, outputs of this layer axe called normalized firing strengths. 
Layer 4 Every node i in this layer is an adaptive node with a node function 


0 4 ,i = Wifi = Wi(piX + qiy + n), (12.5) 

where Wi is a normalized firing strength from layer 3 and {pi, qi, r*} is 
the parameter set of this node. Parameters in this layer axe referred to as 
consequent parameters. 

Layer 5 The single node in this layer is a fixed node labeled E, which computes 
the overall output as the summation of all incoming signals: 

overall output = O5 1 = Wifi = Wt l± (12.6) 

i £ 
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Figure 12.2. ANFIS architecture for the Sugeno fuzzy model , where weight nor- 
malization is performed at the very last layer. 


Thus we have constructed an adaptive network that is functionally equivalent 
to a Sugeno fuzzy model. Note that the structure of this adaptive network is not 
unique; we can combine layers 3 and 4 to obtain an equivalent network with only 
four layers. By the same token, we can perform the weight normalization at the last 
layer; Figure 12.2 illustrates an ANFIS of this type. In the extreme case, we can 
even shrink the whole network into a single adaptive node with the same parameter 
set. Obviously, the assignment of node functions and the network configuration are 
arbitrary, as long as each node and each layer perform meaningful and modular 
functionalities. 

The extension from Sugeno ANFIS to Tsukamoto ANFIS is straightforward, as 
shown in Figure 12.3, where the output of each rule (/*,* = 1, 2) is induced jointly 
by a consequent membership function and a firing strength. For the Mamdani fuzzy 
inference system with max-min composition, a corresponding ANFIS can be con- 
structed if discrete approximations are used to replace the integrals in the centroid 
defuzzification scheme introduced in Section 4.2. However, the resulting ANFIS 
is much more complicated than either Sugeno ANFIS or Tsukamoto ANFIS. The 
extra complexity in structure and computation of Mamdani ANFIS with max-min 
composition does not necessarily imply better learning capability or approxima- 
tion power. If we adopt sum-product composition and centroid defuzzification for 
a Mamdani fuzzy model, a corresponding ANFIS can be constructed easily based 
on Theorem 4.1 directly without using any approximations at all. This is left as 
Exercise 4. 

Throughout this chapter, we shall concentrate on the ANFIS architectures for 
the first-order Sugeno fuzzy model because of its transparency and efficiency. 

Figure 12.4(a) is an ANFIS architecture that is equivalent to a two-input first- 
order Sugeno fuzzy model with nine rules, where each input is assumed to have 
three associated MFs. Figure 12.4(b) illustrates how the two-dimensional input 
space is partitioned into nine overlapping fuzzy regions, each of which is governed 
by a fuzzy if-then rule. In other words, the premise part of a rule defines a fuzzy 
region, while the consequent part specifies the output within the region. 
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Figure 12.3. (a) A two-input two-rule Tsukamoto fuzzy model; (b) equivalent 

ANFIS architecture. 


Y 




Y 



(b) 


Figure 12.4. (a) ANFIS architecture for a two-input Sugeno fuzzy model with nine 
rules; (b) the input space that are partitioned into nine fuzzy regions. 


Next we shall demonstrate how to apply the hybrid learning algorithms devel- 
oped in Chapter 8 to identify ANFIS parameters. 



340 


ANFIS: Adaptive Neuro-Fuzzy Inference Systems Ch. 12 


Table 12.1. Two passes in the hybrid learning procedure for ANFIS. 



Forward pass 

Backward pass 

Premise parameters 

Fixed 

Gradient descent 

Consequent parameters 

Least-squares estimator 

Fixed 

Signals 

Node outputs 

Error signals 


12.3 HYBRID LEARNING ALGORITHM 

Prom the ANFIS architecture shown in Figure 12.1(b), we observe that when the 
values of the premise parameters are fixed, the overall output can be expressed as 
a linear combination of the consequent parameters. In symbols, the output / in 
Figure 12.1(b) can be rewritten as 

f = -m— f 1 + —^2— f 0 

J Wi+W2 J ^ 

= wi(pix + qiy + ri) + w 2 {p 2 x + q 2 y + r 2 ) (12.7) 

= (wix)pi + {wiy)qi 4- (uJi)ri + {w 2 x)p 2 + {w 2 y)q 2 + (w 2 )r 2 , 

which is linear in the consequent parameters pi, qi, ri, p 2 , q 2 , and r 2 . From this 
observation, we have 

S = set of total parameters, 

51 = set of premise (nonlinear) parameters, 

5 2 = set of consequent (linear) parameters 

in Equation (8.31); and H(- ) and F( •,•) are the identity function and the func- 
tion of the fuzzy inference system, respectively, in Equation (8.32). Therefore, the 
hybrid learning algorithm developed in Section 8.5 can be applied directly. More 
specifically, in the forward pass of the hybrid learning algorithm, node outputs go 
forward until layer 4 and the consequent parameters are identified by the least- 
squares method. In the backward pass, the error signals propagate backward and 
the premise parameters are updated by gradient descent. Table 12.1 summarizes 
the activities in each pass. 

As mentioned in Section 8.5, the consequent parameters thus identified are op- 
timal under the condition that the premise parameters are fixed. Accordingly, the 
hybrid approach converges much faster since it reduces the search space dimensions 
of the original pure backpropagation method. Thus we should always look for the 
possibility of decomposing the parameter set in the first place. For Tsukamoto 
ANFIS, this can be achieved if the membership function on the consequent part 
of each rule is replaced by a piecewise linear approximation with two consequent 
parameters, as shown in Figure 12.5. In this case, again, the consequent parameters 
constitute the linear parameter set S 2 and the hybrid learning rule can be employed 
as before. 

As discussed in Section 8.5, there are several ways of combining gradient descent 
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Figure 12.5. Piecewise linear approximation of consequent MFs in Tsukamoto 
ANFIS. 


and the least-squares method. We can choose one of these methods according to 
the available computing resources and required performance level. 

As pointed out by a reviewer of the original ANFIS paper [8], the learning mech- 
anisms should not be applied to determine membership functions in Sugeno ANFIS, 
since they convey a linguistic and subjective descriptions of possibly ill-defined con- 
cepts. We think this is a case-by-case situation and the decision should be left to 
the user. In principle, if the size of the available input-output data set is large, 
then fine-tuning of the membership functions is recommended (or even necessary), 
since human-determined membership functions are seldom optimal in terms of re- 
producing desired outputs. However, if the data set is too small, then it probably 
does not contain enough information about the target system. In this situation, 
the human-determined membership functions represent important information that 
might not be reflected in the data set; therefore, the membership functions should 
be kept fixed throughout the learning process. 

If the membership functions are fixed and only the consequent part is adjusted, 
Sugeno ANFIS can be viewed as a functional-link network [13, 30], where the “en- 
hanced representations” of the input variables are obtained via the membership 
functions. These enhanced representations determined by human experts appar- 
ently provide more insight into the target system than the functional expansion or 
the tensor (outer product) models [30]. By updating the membership functions, we 
are actually tuning this enhanced representation for better performance. 

Because the update formulas for the premise and consequent parameters are de- 
coupled in the hybrid learning rule (see Table 12.1), further speedup of learning is 
possible by using variants of the gradient method or other optimization techniques 
on the premise parameters, such as conjugate gradient descent, second-order back- 
propagation [31], quick propagation [3], and many others. See also Chapter 6 for 
other derivative-based optimization methods. 


12.4 LEARNING METHODS THAT CROSS-FERTILIZE ANFIS 
AND RBFN 

As we discussed in Section 9.5.2, under certain minor conditions, an RBFN (radial 
basis function network) is functionally equivalent to a FIS, and thus adaptive FIS, 
including ANFIS (introduced in this chapter) and CANFIS (introduced in Chap- 
ter 13). This functional equivalence provides a shortcut for better understanding 
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both ANFIS/CANFIS and RBFNs in the sense that developments in either litera- 
ture cross-fertilize the other. In this section, we briefly describe a variety of adaptive 
learning mechanisms that can be used for both adaptive FIS and RBFN. 

An adaptive FIS usually consists of two distinct modifiable parts: the antecedent 
part and the consequent part. These two parts can be adapted by different opti- 
mization methods, one of which is the hybrid learning procedure combining GD 
(gradient descent) and LSE (least-squares estimator), as discussed in Section 12.3. 
Possible combinations of GD and LSE are also discussed in the same section. These 
learning schemes are equally applicable to RBFNs. 

Conversely, the analysis and learning algorithms for RBFNs are also applicable 
to adaptive FIS (ANFIS/CANFIS). The RBFN approximation capability may be 
further improved with supervised adjustments of the center and shape of receptive 
field functions [19, 44]. Besides using a supervised learning scheme alone to update 
all modifiable parameters, a variety of two-phase training algorithms for RBFNs 
have been reported. A typical scheme is to fix the receptive field (radial basis) 
functions first and then adjust the weights of the output layer. There are several 
schemes proposed to determine the center positions (w*) of the receptive field func- 
tions. Lowe discussed selection of fixed centers based on standard deviations of 
training data [22]. Moody and Darken discussed unsupervised or self-organized se- 
lection of centers w* by means of vector quantization or clustering techniques [25, 26] 
(see also Chapters 11 and 15). Then the width parameters a * are determined by by 
taking the average distance to the first several nearest neighbors of Ui s. Nowlan [29] 
employed the so-called soft competition among Gaussian hidden units to locate the 
centers. (This soft-competitive method is based on the maximum likelihood esti- 
mator, in contrast to the so-called hard competition such as the /c-means winner- 
take-all algorithm.) Once these nonlinear parameters are fixed and the receptive 
fields are frozen, the linear parameters (i.e., the weights of the output layer) can be 
updated by either the least-squares method or the gradient method. 

Chen et al. [2] used an alternative method that employs the orthogonal least- 
squares algorithm to determine the Ui s and CVs while keeping the cq’s at a pre- 
determined constant. Other RBFN analyses, such as generalization properties [1] 
and sequential adaptation [11], among others [9, 27], are all applicable to adaptive 
FIS (ANFIS/CANFIS). 

12.5 ANFIS AS A UNIVERSAL APPROXIMATOR* 

This section explains an intertesting property that when the number of rules is 
not restricted, a zero-order Sugeno model has unlimited approximation power for 
matching any nonlinear function arbitrarily well on a compact set. This fact is 
intuitively reasonable. However, to give a mathematical proof, we need to apply 
the Stone- Weierstrass theorem [12, 32]. 


Theorem 12.1 Stone- Weierstrass theorem 
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Let domain D be a compact space of N dimensions, and let T be a set of continuous 
real-valued functions on D satisfying the following criteria: 

1. Identity function: The constant f(x) = 1 is in T. 

2. Separability: For any two points x\ xi in D , there is an / in T such that 
f{x i) t 6 f{x 2 ). 

3. Algebraic closure: If / and g are any two functions in T, then fg and 
af -I- bg are in F for any two real numbers a and b. 

Then T is dense in C{D), the set of continuous real- valued functions on D. In other 
words, for any e > 0 and any function g in C(D ), there is a function / in T such 
that |^(x) — f(x) | < e for all x € D. 


□ 

In applications of fuzzy inference systems, the domain in which we operate is 
almost always compact. It is a standard result in real analysis that every closed 
and bounded set in R N is compact. In what follows, we shall describe how to apply 
the Stone- Weierstrass theorem to show the universal approximation power of the 
zero-order Sugeno model. 

Identity Function 

The first hypothesis of the Stone- Weierstrass theorem requires that our fuzzy 
inference system be able to compute the identity function fix) = 1. An obvious 
way to compute this function is to set the consequence part of each rule equal to 1. 
A fuzzy inference system with only one rule suffices to satisfy this requirement. 

Separability 

The second hypothesis of the Stone- Weierstrass theorem requires that our fuzzy 
inference system be able to compute functions that have different values for different 
points. [Without this requirement, the trivial set of functions / : f(x) = c, c € R 
would satisfy the Stone- Weierstrass theorem.] Again, this is obviously achievable 
by any fuzzy inference system with appropriate parameters. 

Algebraic Closure — Additive 

The third hypothesis of the Stone- Weierstrass theorem requires that our fuzzy in- 
ference systems be invariant under addition and multiplication. Suppose that we 
have two fuzzy inference systems S and 5; each of them has two rules, and the final 
output of each system is specified as 

s . _ wifi +W2/2 

W\ + W2 


( 12 . 8 ) 
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and 


S : z = 


Wifi +W2/2 
W\ + W2 


Then the sum of z and z is equal to 


(12.9) 


, W1/1+W2/2 , ,wifi+w 2 h 

az + bz = a 1 - 0 — — 

W\ +W2 W\+ W2 

A A A A 

_ WiWi(afi + bfi) + wiw 2 (afi + 6/2) + W2W 1 (af2 + 6/i) + W2W2\af2 + 6/2) 

WiW\ + W\W 2 + W2W1 + W2W2 

Thus we can construct a four-rule fuzzy inference system that computes az + bz , 
where the firing strength and the output of each rule are defined by WiWj and 
afi + bfj ( i,j = 1 or 2 ), respectively. 

Algebraic Closure — Multiplicative 

Invariantness under multiplication is the final feature we must demonstrate be- 
fore we can conclude that the Stone- Weierstrass theorem can be applied to the 
zero-order Sugeno fuzzy model. The product of the outputs of two fuzzy inference 
systems 2r and z can be expressed as 


wiwififi + wiw 2 fif 2 + W2W1/2/1 + W2W2/2/2 
W\W\ -I- W\W2 + W 2 W\ + W 2 W 2 


( 12 . 10 ) 


Thus we can construct a four-rule fuzzy inference system that computes zz, where 
the firing strength and the output of each rule are defined by WiWj and fifj ( i,j = 
1 or 2), respectively. 

From the preceding description, we conclude that the ANFIS architectures that 
compute az -I- bz and zz are of the same class as those of S and S if and only if the 
membership functions used are invariant under multiplication. One class of MFs 
that satisfy this property is the scaled Gaussian membership function [39, 41]: 


/Mi (*) = h exp[— (- — — ) 2 ]. (12.11) 

ai 

Another class of MFs that are invariant under the product operator is MFs for crisp 
sets, which assume values of either 0 or 1. MFs of this kind can be viewed as a 
special case of either the generalized bell MF with parameter b approaching 00, or 
as the trapezoidal MF with a = b and c = d (see Section 2.4.1). 

Therefore, with an appropriate class of membership functions, a zero-order 
Sugeno model can satisfy the four criteria of the Stone- Weierstrass theorem. That 
is, for any given e > 0 and any real- valued function g, there is a zero-order Sugeno 
model S such that | g{x) — £(x)| < e for all x in the underlying compact set. The pre- 
ceding argument of universal approximation power applies to other types of fuzzy 
models as well, since the zero-order Sugeno model is a special case of the Mamdani 
fuzzy model, the Tsukamoto fuzzy model, and other higher-order Sugeno models. 



Sec. 12.6. Simulation Examples 


345 



Input Variable 


Figure 12.6. A typical initial MF setting, where input range is assumed to be 
[0,12]. (MATLAB file: init_mf.m) 


However, caution should be taken in accepting this claim, since there has been 
no mention of how to construct the Sugeno model according to a given training 
data set; the Stone- Weierstrass theorem yields only an existence theorem, but not 
a constructive method. 

12.6 SIMULATION EXAMPLES 

This section presents simulation results of the ANFIS architecture for the Sugeno 
fuzzy model (see Figures 12.1 and 12.4). In the first two examples, ANFIS is used 
to model two nonlinear functions; the results are compared with those achieved 
by backpropagation MLP approaches (see Section 9.4) and other earlier work on 
fuzzy modeling. In the third example, we use ANFIS for on-line identification of a 
nonlinear component in a discrete control system. In the last example, we predict a 
chaotic time series using ANFIS and demonstrate its superiority to several standard 
statistical and neural network approaches. The purpose of these examples is to give 
a detailed description of how to use ANFIS and how it performs; more ANFIS 
applications for a number of different domains are given in Chapter 19. 

12.6.1 Practical Considerations 

In a conventional fuzzy inference system, the number of rules is determined by an 
expert who is familiar with the target system to be modeled. In our simulation, 
however, no expert is available and the number of MFs assigned to each input 
variable is chosen empirically — that is, by plotting the data sets and examining them 
visually, or simply by trial and error. For data sets with more than three inputs, 
visualization techniques are not very effective and most of the time we have to rely 
on trial and error. This situation is similar to that of neural networks; there is just 
no simple way to determine in advance the minimal number of hidden units needed 
to achieve a desired performance level. (There are several other techniques for 
determining the numbers of MFs and rules, such as CART and clustering methods. 
But here we shall not use them since the purpose of this section is to demonstrate 
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the learning capability of ANFIS.) 

If we choose the grid partition in Figure 4.13(a) (see page 87), the number of 
MFs on each input variable uniquely determines the number of rules. The initial 
values of premise parameters are set in such a way that the centers of the MFs are 
equally spaced along the range of each input variable. Moreover, these MFs satisfy 
the condition of e-completeness [17, 18] with e = 0.5, which means that given a 
value x of one of the inputs in the operating range, we can always find a linguistic 
label A such that ha{x) > e. In this manner, the fuzzy inference system can 
provide smooth transition and sufficient overlap from one linguistic label to another. 
Although we did not attempt to maintain the e-completeness during the training 
process, it can be easily achieved by using a constrained gradient method [48]. 
Figure 12.6 shows a typical initial MF setting when the number of MFs is four and 
the input range is [0,12]. Throughout the simulation examples presented in this 
section, all the membership functions used are the generalized bell function defined 
in Equation (12.2): 


fi A (x) = gbell(r; a, 6, c) 



( 12 . 12 ) 


which contains three fitting parameters, a, 6, and c. Each of these parameters has 
a physical meaning: c determines the center of the MF ; a is half the width of the 
MF; and b (together with a) controls the slopes at the crossover points (where the 
MF value is 0.5). Figure 2.9 in Chapter 2 illustrates these concepts. 

We mentioned that the step size k in Equation (6.61) may influence the speed 
of convergence. In the simulation reported next, we use two heuristic guidelines to 
update the step size k adaptively. These two guidelines (see also Figure 6.18) and 
the general observations that lead to them are detailed in Section 6.7.2 of Chapter 6. 


12.6.2 Example 1: Modeling a Two-Input Sine Function 


In this example, we use ANFIS to model a two-dimensional sine equation defined 
by 


z = sinc(r, y) = 


sin(r) sin (y) 
xy 


(12.13) 


From the evenly distributed grid points of the input range [—10, 10] x [—10, 10] of 
the preceding equation, 121 training data pairs were obtained. The ANFIS used 
here contains 16 rules, with four membership functions assigned to each input vari- 
able. The total number of fitting parameters is 72, including 24 premise (nonlinear) 
parameters and 48 consequent (linear) parameters. (We also tried ANFIS models 
with four rules and nine rules, but these models are too simple to describe the highly 
nonlinear sine function.) 

Figure 12.7 shows the RMSE (root mean squared error) curves for both a 
2-18-1 backpropagation MLP and the ANFIS architecture used here. Each curve is 
the result of averaging 10 error curves from 10 runs. For the MLP, these 10 runs 
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Figure 12.7. RMSE curves for the MLP and ANFIS. 


were started from different sets of initial random weights. For ANFIS, these 10 runs 
correspond to 10 k values ranging from 0.01 to 0.10. 

The backpropagation MLP, which contained 73 fitting parameters (connection 
weights and thresholds), was trained with quick propagation [3], which is consid- 
ered one of the best learning algorithms for backpropagation MLPs. Figure 12.7 
shows how ANFIS approximates a highly nonlinear surface more effectively than 
an MLP. It should be emphasized that for the same number of epochs (250 in Fig- 
ure 12.7), the ANFIS model did take longer since the hybrid learning rule involves 
more computation. However, even we increased the training epochs for the MLP, 
its performance stayed the same since its error curve levels off after 100 epochs, as 
shown in Figure 12.7. The poor performance of MLPs seems due to their struc- 
ture: The learning processes could become trapped in local minima because of 
the randomly initialized weights, or some neurons could be pushed into saturation 
during the training. Either of these two situations can significantly decrease the 
approximation power of MLPs. 

The training data and reconstructed surfaces at different epochs during training 
are depicted in Figure 12.8. Since the error measure is always computed after a for- 
ward pass (that is, the first half of a whole epoch) is completed, the epoch numbers 
shown in the caption of Figure 12.8 always end with .5. (In our later descriptions of 
ANFIS applications in Chapter 19, we will round these epoch numbers to the next 
integers for simplicity.) Note that the reconstructed surface after 0.5 epochs is the 
result after identifying consequent parameters using LSE for the first time; yet it 
already looks similar to the training data surface. 

Figure 12.9 lists the initial and final membership functions. It is interesting 
to observe that the sharp changes in the training data surface around the origin 
are accounted for by the membership functions moving toward the origin. Theo- 
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Figure 12.8. Training data ( upper left) and reconstructed surfaces at 0.5 (upper 
right), 99.5 (lower left), and 249.5 (lower right) epochs in Example 1. (MATLAB 
file: trn_2in.m) 


retically, the final MFs on both x and y should be symmetric with respect to the 
origin. However, they are not symmetric, due to computer truncation errors and the 
approximated initial conditions used for bootstrapping the recursive least-squares 
estimator. 

12.6.3 Example 2: Modeling a Three-Input Nonlinear Function 

The training data in this example were obtained from a three-input nonlinear equa- 
tion defined by 

output = (1 + X 05 + y- 1 + z- 1 ' 5 ) 2 . (12.14) 

This equation was also used by Takagi and Hayashi [36], Sugeno and Kang [34], and 
Kondo [14] to test their modeling approaches. Here the ANFIS architecture (see 
Figure 12.10) contains eight rules, with two membership functions assigned to each 
input variable. A total of 216 training data and 125 checking data were sampled 
uniformly from the input ranges [1, 6] x [1, 6] x [1,6] and [1.5, 5.5] x [1.5, 5.5] x [1.5, 5.5], 
respectively. The training data were used for training ANFIS, while the checking 
data were used for verifying the identified ANFIS only. To allow comparison, we 
use the same performance index adopted in refs. [34, 14]: 

APE = average percentage error = ^ L — . 100%, (12.15) 
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Figure 12.9. Initial and final MFs in Example 1. (MATLAB file: trn_2in.m) 



Predicted 

Output 


Figure 12.10. The ANFIS model for Example 2. (The connections from inputs to 
layer 4 are not shown.) 


where P is the number of data pairs, and T(i) and 0(i) are the ith desired output 
and predicted output, respectively. 

Figure 12.11 illustrates the membership functions before and after training. The 
training error curves with different initial step sizes (k = 0.01 to 0.09) are shown in 
Figure 12.12(a), which indicates that the initial value of k does not have a critical 
influence on the final performance as long as k is not too large. Figure 12.12(b) 
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Figure 12.11. Example 2: (a) MFs before learning; (b), (c), (d) MFs after learn- 
ing. (MATLAB file: trn_3in.m) 



Figure 12.12. Error curves of Example 2: (a) nine training error curves for 
nine initial step sizes from 0.01 (rightmost) to 0.09 (leftmost); (b) training (solid 
line) and checking (dashed line) error curves with initial step size equal to 0.1. 
(MATLAB file: trn_3in.m) 


shows the training and checking error curves with initial step size equal to 0.1. After 
199.5 epochs, the final results were APE^ rn = 0.043% and APE C ^ = 1.066%, which 
axe listed in Table 12.2 along with the results of other earlier work [14, 34]. Here 
ANFIS achieves the best performance at the cost of requiring more training data. 
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Table 12.2. Example 2: comparisons with earlier work. The last three rows are 
from ref. [34]. 


Model 

Training 

error 

Checking 

error 

Par am. 
number 

Training 
data size 

Checking 
data size 

ANFIS 

0.043% 

1.066% 

50 

216 


GMDH model [14] 

4.7% 

5.7% 

— 

20 


Fuzzy model 1 [34] 

1.5% 

2.1% 

22 

20 

20 

Fuzzy model 2 [34] 

0.59% 

3.4% 

32 

20 

20 


12.6.4 Example 3: On-line Identification in Control Systems 

Here we repeat a simulation example from ref. [28], where a 1-20-10-1 backprop- 
agation MLP is employed to identify a nonlinear component in a control system, 
except that here we use ANFIS instead to show its superiority. The plant under 
consideration is governed by the following difference equation: 

y(k + 1) = 0.3 y(k) + 0.6 y(k - 1) + f(u(k )), (12.16) 

where y(k) and u(k) are the output and input, respectively, at time step k. The 
unknown function /(•) has the form 

f(u ) = 0.6 sin(7ru) + 0.3 sin(37ru) + 0.1 sin(57ru). (12.17) 

In order to identify the plant, a series-parallel model governed by the difference 
equation 

y{k + 1) = 0.3 y(k) + 0.6 y{k - 1) + F{u{k)) (12.18) 

was used, where F( ) is the function implemented by ANFIS and its parameters 
are updated at each time step. Here the ANFIS architecture has seven MFs on its 
input (thus seven rules with 35 fitting parameters) and the on-line learning paradigm 
adopted has a learning rate r] = 0.1 and a forgetting factor A = 0.99. The input to 
the plant and the model was a sinusoid u(k ) = sin(2nk/ 250); the adaptation started 
at k = 1 and stopped at k = 250. As shown in Figure 12.13, the output of the model 
follows the output of the plant almost immediately, even after the adaptation was 
stopped at k = 250 and the u(k ) was changed to 0.5sm(27rfc/250) + 0.5sm(27rfc/25) 
at k = 500. In comparison, the MLP in ref. [28] failed to follow the plant when 
the adaptation was stopped at k = 500, and the identification procedure had to 
continue for 50, 000 time steps using a random input. Table 12.3 summarizes the 
comparison. 

In the preceding simulation, the number of rules is determined by trial and 
error. If the number of MFs is below seven, then the model output will not follow 
the plant output satisfactorily after 250 adaptations. By using the more effective off- 
line learning, we can decrease the number of rules. Figures 12.14, 12.15, and 12.16 
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Figure 12.13. Example 3: (a) u(k); (b) f(u(k)) and F(u(k)); (c) plant and model 
outputs. 


Table 12.3. Example 3: comparison with MLP identifier [28].) 


Method 

Parameter number 

Time steps of adaptation 

MLP 

261 

50000 

ANFIS 

35 

250 


show the results after 49.5 epochs of off-line learning when the number of MFs is 
5, 4, and 3, respectively. From these figures, it is obvious that ANFIS is a good 
model even when as few as three MFs axe used. However, as the number of rules 
becomes smaller, the relationship between F{u) and each rule’s output becomes 
less clear, in the sense that it is harder to sketch F(u) from each rule’s output. In 
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Initial MFs 



f(u) and ANFIS Outputs 



Final MFs 



Each Rule's Outputs 



Figure 12.14. Example 3: off-line learning with five MFs. (MATLAB command: 
trn_lin(5)) 


other words, when the number of parameters is reduced moderately, ANFIS usually 
still does a satisfactory job, but at the cost of sacrificing its semantics in terms of 
the local-description nature of fuzzy if-then rules. In this case, ANFIS is less of 
a structured knowledge representation and more like a black-box model, such as a 
backpropagation MLP. 

12.6.5 Example 4: Predicting Chaotic Time Series 

Examples 1, 2, and 3 demonstrate the capability of ANFIS for modeling nonlinear 
functions. In this example, we demonstrate how ANFIS can be employed to predict 
future values of a chaotic time series. The performance obtained in this example is 
compared with the results of a cascade-correlation neural network approach [7] and 
the conventional auto-regressive (AR) model. 

The time series used in our simulation is generated by the chaotic Mackey-Glass 
differential delay equation [23] defined as 

* (t) = ~ alX(t) ' (12 ' 19) 

The prediction of future values of this time series is a benchmark problem that has 
been used and reported by a number of connectionist researchers, such as Lapedes 
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Final MFs 




Figure 12.15. Example 3: off-line learning with four MFs. (MATLAB command: 
trn_lin(4)) 


and Farber [16], Moody [24, 26], Jones et al. [9], Crowder [7], and Sanger [33]. 

The goal of the task is to use past values of the time series up to time t to 
predict the value at some point in the future t + P. The standard method for this 
type of prediction is to create a mapping from D points of the time series spaced 
A apart — that is, [x(t — (D — 1)A), ..., x(t — A), x (£)], to a predicted future 
value x(t + P). To allow comparison with earlier work (Lapedes and Farber [16], 
Moody [24, 26], Crowder [7]), the values D = 4 and A = P = 6 were used. All 
other simulation settings were arranged to be as close as possible to those reported 
in ref. [7]. 

To obtain the time series value at each integer time point, we applied the fourth- 
order Runge-Kutta method to find the numerical solution to Equation (12.19). The 
time step used in the method was 0.1, initial condition x(0) = 1.2, and r = 17. In 
this way, x(t) was thus obtained via numerical integration for 0 < t < 2000. [We 
assume x(t) = 0 for t < 0 in the integration.] From the Mackey-Glass time series 
x(t ), we extracted 1000 input-output data pairs of the following format: 

[x(t — 18), x(t - 12), x(t — 6), x(t)\ x(t -I- 6)], (12.20) 

where t = 118 to 1117. The first 500 pairs were used as the training data set for 
ANFIS, while the remaining 500 pairs were the checking data set for validating 
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Figure 12.16. Example 3: off-line learning with three MFs. (MATLAB command: 
trn_lin(3)) 


the identified ANFIS. The number of MFs assigned to each input of the ANFIS 
was set to two, so the number of rules is 16. Figure 12.17(a) depicts the initial 
membership functions for each input variable. The ANFIS used here contains a 
total of 104 fitting parameters, of which 24 are premise (nonlinear) parameters and 
80 are consequent (linear) parameters 

After 499.5 epochs, we had RMSE^ rn = 0.0016 and RMSE^j^ = 0.0015, which 
are much better than the results of the other approaches, as will be explained later. 
The desired and predicted values for both training data and checking data are 
essentially the same in Figure 12.18(a); the differences between them can only be 
seen on a much finer scale, as shown in Figure 12.18(b). Figure 12.17(b) is the 
final membership functions; Figure 12.19 shows the RMSE curves, which indicate 
that most of the learning was done in the first 100 epochs. It is unusual to observe 
that RMSEfa n > RMSE C ^ during the training process, as is the case here. (If 
we change the role of the training and checking data set, then we have the usual 
situation in which RMSE^ vn < RMSE^y. during the learning process.) Since 
both RMSEs are both very small, we conjecture that (1) the ANFIS used here has 
captured the essential components of the underlying dynamics; and (2) the training 
data contain the effects of the initial conditions [remember that we set x(t) = 0 for 
t < 0 in the integration], which might not be easily accounted for by the essential 
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(a) 


(b) 


Figure 12.17. Membership functions in chaotic time series prediction: (a) initial 
MFs for all four inputs; (b) MFs after learning. (MATLAB file: trn_4in.m) 


components identified by ANFIS. 

As a comparison, we performed the same prediction using the AR model with 
the same number of parameters: 


x(t + 6) = ao + a\x{t) + a 2 x(t — 6) + • • • + aiozx(t — 102 * 6), (12.21) 

where there are 104 fitting parameters a*, k = 0 to 103. From t = 712 to 1711, we 
extracted 1000 data pairs, of which the first 500 were used to identify a* and the 
remainder were used for checking. The results obtained through the standard least- 
squares method were RMSE^ rn = 0.005 and RMSE^y. = 0.078, which are much 
worse. Figure 12.20 shows the predicted values and the prediction errors. Obviously, 
the over-parameterization of the AR model causes over-fitting in the training data, 
which produces large errors in the checking data. To search for an appropriate AR 
model in terms of the best generalization capability, we tried different AR models 
with the number of parameters varying from 2 to 104. Figure 12.21 displays the 
results; the AR model with the best generalization capability is obtained when the 
number of parameters is 45. Using this AR model, we repeated the generalization 
test. Figure 12.22 shows the results; in this case, there is no over-fitting, at the 
price of larger training errors. 

The nonlinear ANFIS obviously outperforms the linear AR model. However, 
identification of the AR model took only a few seconds, while the ANFIS simulation 
took about 1.5 hours on an HP (Hewlett-Packard) Apollo 700 Series workstation. 

Table 12.4 lists the generalization capabilities of other methods, which were 
measured by using each method to predict 500 points immediately following the 
training set. Here the non-dimensional error index (NDEI) [16, 7] is defined as 
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Figure 12.18. Example 3, (a) Mackey-Glass time series from t = 124 to 1123 and 
six-step-ahead prediction (which is indistinguishable from the time series here); (b) 
prediction error. (MATLAB file: trn_4in.m) 


xl(H 



Figure 12.19. Training (solid line) and checking (dashed line) RMSE curves for 
ANFIS modeling. (MATLAB file: trn_4in.m) 


the root mean square error divided by the standard deviation of the target series. 
(Note that the average relative variance used in refs. [42, 43] is equal to the 
square of NDEI.) The remarkable generalization capability of ANFIS, we believe, 
is derived from the following facts: 
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Figure 12.20. (a) Mackey-Glass time series ( solid line) from t = 718 to 1717 
and six-step-ahead prediction (dashed line) by AR model with parameter = 104; (b) 
prediction errors. (The first 500 data points are training data, while the remaining 
are for validation.) 



Figure 12.21. Training (solid line) and checking (dashed line) errors of AR models 
with numbers of parameters varying from 2 to 104 - 


• ANFIS can achieve a highly nonlinear mapping, as shown in Examples 1, 2, 
and 3. Therefore, it is superior to common linear methods in reproducing 
nonlinear time series. 

• The ANFIS used here has 104 adjustable parameters, far fewer than those 
used in the cascade-correlation NN (693, the median) and backpropagation 
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Figure 12.22. Example 3, (a) Mackey-Glass time series (solid line) from t — 364 
to 1363, and six-step-ahead prediction (dashed line) by the best AR model with 45 
parameters ; (b) prediction errors. 


MLP (about 540) listed in Table 12.4. 

• Although not based on a priori knowledge, the initial parameters of ANFIS are 
intuitively reasonable and all the input space is covered properly; this results 
in fast convergence to good parameter values that captures the underlying 
dynamics. 

• ANFIS consists of fuzzy rules which are local mappings (which are called lo- 
cal experts in ref. [10]) instead of global ones. These local mappings facilitate 
the minimal disturbance principle [45], which states that the adaptation 
should not only reduce the output error for the current training pattern but 
also minimize disturbance to response already learned. This is particularly 
important in on-line learning. We also found that the use of least-squares 
method to determine the output of each local mapping is of particular impor- 
tance. Without using LSE, the learning time would have been 5 to 10 times 
longer. 

Table 12.5 lists the results of the more challenging generalization test, in which 
P is 84 and 85 for rows 1 through 6 and 7 through 10, respectively. The results of 
the first six rows were obtained by iterating the prediction of P = 6 until P = 84. 
ANFIS still outperformed these statistical and connectionist approaches in all cases 
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Table 12.4. Comparison of generalization capability for P = 6. (The last four 
rows are from [7].) 


Method 

Training cases 

Non-dimensional error index 

ANFIS 

500 

0.007 

AR model 

500 

0.19 

Cascaded-correlation NN 

500 

0.06 

Backpropagation MLP 

500 

0.02 

6th-order polynomial 

500 

0.04 

Linear predictive method 

2000 

0.55 


Table 12.5. Comparison of generalization capability for P = 84 (the first six 
rows) and 85 (the last four rows). Results for the first six methods are generated 
by iterating the solution at P = 6. Results for localized receptive fields (LRFs) and 
multiresolution hierarchies (MRHs) are for networks trained for P = 85. The last 
eight rows are from ref. [ 7] . 


Method 

Training cases 

Non-dimensional error index 

ANFIS 


0.036 

AR model 


0.39 

Cascaded-correlation NN 


0.32 

Backpropagation MLP 


0.05 

6th-order polynomial 


0.85 

Linear predictive method 

2000 

0.60 

LRF 


0.10-0.25 

LRF 


0.025 - 0.05 

MRH 


0.05 

MRH 


0.02 


except where a substantially larger amount of training data were used (e.g., the last 
row of Table 12.5). Figure 12.23 depicts the generalization test results for ANFIS 
when P = 84. 


12.7 EXTENSIONS AND ADVANCED TOPICS 

Because of the extreme flexibility of adaptive networks, ANFIS can be generalized 
in a number of different ways. For instance, the membership functions can be 
changed to any of the parameterized MFs described in Section 2.4 of Chapter 2. 
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Figure 12.23. Generalization test of ANFIS for P = 84. 


Furthermore, we can replace the II nodes in layer 2 with the parameterized T- 
norm (see Section 2.5.3 of Chapter 2) and let the learning rule decide the best 
T-norm operator for a specific application. Moreover, the realization of rules with 
OR’ed antecedents, linguistic hedges (Section 3.3.1), and multiple outputs can be 
put into ANFIS accordingly. An extended ANFIS model, CANFIS, is discussed in 
Chapter 13. 

Another important issue in the training of ANFIS is how to preserve some intu- 
itive features that make the resulting fuzzy rules easy to interpret. These features 
include e-completeness [17, 18], moderate fuzziness, and reasonably shaped mem- 
bership functions. Although we did not pursue these directions in our discussion, 
most of these features can be preserved by maintaining certain constraints or by 
modifying the error measure, as explained next. 

• The requirement of e-completeness ensures that for any given value of an 
input variable, there is at least an MF with grade greater than or equal to e. 
This guarantees that the whole input space is covered properly if e is greater 
than zero. The e-completeness can be maintained by the constrained gradient 
descent [48]. For instance, suppose that e = 0.5 and the adjacent membership 
functions are of the generalized bell MF in Equation (12.2) with parameter sets 
{aj, bi,C{} and {aj+i, 6j +1 , Cj + i}. Then €-completeness is satisfied if c\ + a, > 
Ci+i — a{+ 1 , and the satisfaction of this constraint is guaranteed throughout 
the training if the constrained gradient descent is employed. 
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• Moderate fuzziness refers to the requirement that within most regions of 
the input space, there should be a dominant fuzzy if-then rule with a firing 
strength close to unity that accounts for the final output, instead of multiple 
rules with similar firing strengths. This prevents neighboring MFs from hav- 
ing too much overlap and makes the rule set more informative. In particular, 
this eliminates one of the most unpleasant situations that an MF goes under 
the other one. An simple way to keep moderate fuzziness is to use a modified 
error measure 

p 

E' = E + ln(wJi)], (12.22) 

i=i 

where E is the original squared error; /? is a weighting constant; P is the size of 
the training data set; and Wi is the normalized firing strength of the ith rule 
[see Equation (12.4)]. The second term Wi In («;*)] in the preceding 

equation is Shannon’s information entropy [5], and its value is minimized 
whenever there is a Wf equal to one. Since this modified error measure is not 
based on data fitting alone, the ANFIS thus trained also has a potentially 
better generalization capability. The improvement of generalization by using 
an error measure based on both data fitting and weight elimination has been 
reported in the neural network literature [42, 43]. 

• The easiest way to maintain reasonably shape for each MF is to parameterize 
the MF correctly to reflect adequate constrains. For one thing, we want all 
the MFs to remain bell-shaped regardless of their parameter values. This is 
not true for the generalized bell MF in Equation (12.2) if b{ < 0; a quick fix 
is to replace b{ with b? + k, where k is a positive fixed constant. 

Throughout this chapter, we have assumed that the structure of ANFIS is fixed 
and that the parameter identification is solved through the hybrid learning 
rule. However, to make the whole approach more complete, the structure identifi- 
cation [34, 35] (which is concerned with the selection of an appropriate input-space 
partition style, the number of membership functions on each input, and so on) is 
equally important to the successful application of ANFIS, especially for modeling 
problems with a large of inputs. Effective partitioning of the input space can de- 
crease the number of rules and thus increase the speed in both the learning and 
application phases. Advances in neural network structure identification [4, 20] may 
shed some light on this. 

Fuzzy control is by far the most successful application of fuzzy set theory and 
fuzzy inference systems. The adaptive capability of ANFIS makes it almost directly 
applicable to adaptive control and learning control. In fact, ANFIS can replace 
almost any neural network in a control system and perform the same function. We 
shall give a detailed coverage of neuro-fuzzy control in Chapters 17 and 18. 

The active role of neural networks in signal processing [15, 47] also suggests 
similar applications for ANFIS. The nonlinearity and structured knowledge rep- 
resentation of ANFIS are its primary advantages over classical linear approaches 
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in adaptive filtering [6] and adaptive signal processing [46], such as identification, 
inverse modeling, predictive coding, adaptive channel equalization, adaptive inter- 
ference (noise or echo) canceling, and so on. 

By employing the adaptive network as a common framework, we can construct 
other adaptive fuzzy models that are tailored for applications such as data classifi- 
cation and feature extraction. These types of adaptive fuzzy models are covered in 
Chapter 20. 


EXERCISES 

1. How many premise and consequent parameters are there in the ANFIS architec- 
tures shown Figure 12.1(b) and Figure 12.4(a)? (Assume that the generalized 
bell function is used for all the membership functions.) 

2. Which node functions in Figure 12.1 need to be changed if the ANFIS architec- 
ture is to become an equivalent structure for a zero-order Sugeno fuzzy model? 
How many parameters does the resulting ANFIS have? 

3. Construct an ANFIS that is equivalent to a two- input two- rule Mamdani fuzzy 
model with max-min composition and centroid defuzzification. Explain the 
function you use to approximate the centroid defuzzification. Specify how to 
convert this function into node functions in the resulting ANFIS. 

4. Construct an ANFIS that is equivalent to a two-input two-rule Mamdani fuzzy 
model with sum-product composition and centroid defuzzification. In particu- 
lar, the constructed ANFIS should take advantage of Theorem 4.1 and have a 
simpler structure than the one based on max-min composition in the previous 
exercise. 

5. Prove that the scaled Gaussian function in Equation (12.11) is invariant under 
the product operator. (In other words, prove that the product of two scaled 
Gaussian functions is still a scaled Gaussian function.) 

6. Let the MFs in Figure 12.6 be specified by four generalized bell functions, each 
with three parameters {aj,bj,Cj}, i = 1 (leftmost) to 4 (rightmost). Describe 
the relationships (constraints) between these parameters such that any two 
neighboring MFs always intersect at a height that is (a) equal to 0.5; (b) equal 
to 0.3; (c) between 0.3 and 0.5. 

7 . The training data of Example 1 are generated at the beginning of the MATLAB 
file trn_2in.m. Use your favorite neural network program (or the MATLAB 
program tanmlp.m) to learn this mapping; the neural network should have 
approximately the same number of parameters as that in Example 1. Plot 
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your results and compare them with Figure 12.7. [Note that there are more 
than 20 different public domain NN software programs that can be retrieved 
by anonymous ftp. A list of ftp sites for these codes, including the quick- 
propagation neural network, can be found in the monthly posting of FAQ (fre- 
quently asked questions) on the Usenet newsgroup comp, ai .neural -nets. The 
FAQ file neural -net -faq is also archived in the periodic posting archive on 
host rtfm.mit . edu, under the anonymous ftp directory pub/usenet /news. For 
people without ftp access, a mail server can be used as well to obtain this file. 
For more information, send an e-mail message to mail-serverQrtfm.mit.edu 
with help and index in the body on separate lines.] 

8. Repeat the previous exercise, but increase the size of the neural network until 
it can yield about the same performance as the ANFIS in Figure 12.7. How 
many parameters are there in your neural network? 

9. The training data of Example 2 are generated at the beginning of the MATLAB 
file trn_3in.m. Use your favorite neural network program to learn this map- 
ping; the neural network should have approximately the same number of pa- 
rameters as that in Example 2. Plot your results and compare them with 
Figure 12.12. 

10. The off-line training data of Example 3 are generated at the beginning of the 
MATLAB file trn_lin.m. Use your favorite neural network program to learn 
this mapping. Plot your results and compare them with Figures 12.14, 12.15, 
and 12.16. Vary the number of hidden nodes to see how many hidden units are 
necessary to achieve a performance level similar to Figure 12.15. 

11. The Mackey-Glass time series, as specified by the time delay differential equa- 
tion (Equation (12.19)), can be obtained via a numerical method called the 
fourth-order Runge-Kutta method. The file ts.c (available via FTP, see 
page xxiii) is a simple C program that generates the Mackey-Glass time series 
using this numerical method. Compile this program and run it. Use MATLAB 
to plot the resulting time series when r = 17 and 50. 

12. Write a MATLAB script to transform the raw time series data (when r = 17, 
see the previous exercise) into the format of the training data (and the checking 
data) shown in Equation (12.20). 

13. Write MATLAB scripts to perform polynomial fitting to the training data ob- 
tained in the previous exercise. Specifically, try (a) first-order polynomial fit- 
ting; (b) second-order polynomial fitting; and (c) third-order polynomial fitting. 
Verify your model by using both training and checking data, and compare the 
results with Figure 12.18. 

14. Using the polynomial models obtained in the previous exercise, perform itera- 
tive prediction until P = 84. Compare the results with Figure 12.23. 
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15. Redo the simulation in Example 4, but swap the training and checking data. 
Plot the prediction errors and the MFs before and after training. 

16. Redo the simulation in Example 4, but set the initial step to 0.1, 0.2, . . . , 1.0, 
respectively. Plot 10 training error curves corresponding to these initial step 
sizes. Do these training error curves vary a lot? 
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Coactive Neuro- Fuzzy 
Modeling: Toward 
Generalized ANFIS 


E. Mizutani 


13.1 INTRODUCTION 

As discussed in the preceding chapter, ANFIS enjoys its own hybrid learning strat- 
egy, automatically tuning a Sugeno-type inferencing system and generating a single 
output of a weighted linear combination of the consequents. This chapter high- 
lights extensions of such ANFIS concepts. Specifically, we discuss multiple- output 
ANFIS with nonlinear fuzzy rules. ANFIS originally made its debut as an adaptive 
system with much emphasis on its offering the advantage of being a linguistically 
interpretable fuzzy inference system (FIS) that allows prior knowledge to be em- 
bedded in its construction and allows the possibility of understanding the results of 
learning. 

The extensions emphasize characteristics of a more fused neuro-fuzzy system 
that enjoys many of the advantages claimed for neural networks (NNs) and the 
linguistic interpretability of an FIS. As a result, we call this generalized ANFIS 
“C ANFIS,” which stands for coactive neuro-fuzzy inference systems, wherein 
both NNs and FIS play active roles in an effort to reach a specific goal [27]. In this 
sense, CANFIS migrates various degrees of the neuro-fuzzy spectrum between the 
two extremes: a completely understandable FIS and a black-box NN, which is at the 
other end of the interpretability spectrum. Neuro-fuzzy models can be characterized 
by the neuro-fuzzy spectrum, in light of linguistic transparency and input-output 
mapping precision. 

In this chapter, using trivial examples, we shall clarify the nature of CANFIS 
from an NN perspective and survey the modeling methodologies of a generalized 
ANFIS. In Chapter 22, we shall describe the application of CANFIS to a more 
complicated problem. 
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Figure 13.1. Two-output CANFIS architecture with two rules per output (left), 
and a comparable M ANFIS structure (right). 

13.2 FRAMEWORK 

We delineate the basic outline of CANFIS in connection with a radial basis function 
network (RBFN) and a well-known backpropagation multilayer perceptron (MLP) 
from an architectural standpoint. (For functional equivalence between FIS and 
RBFN, refer to Section 9.5.2.) 

13.2.1 Toward Multiple Inputs/Outputs Systems 

CANFIS has extended the notion of a single-output system, ANFIS, to produce 
multiple outputs. One way to get multiple outputs is to place as many ANFIS 
models side by side as there are required outputs. In this MANFIS (multiple ANFIS) 
model illustrated in Figure 13.1 (right), no modifiable parameters are shared by 
the juxtaposed ANFIS models. That is, each ANFIS has an independent set of 
fuzzy rules, which makes it difficult to realize possible certain correlations between 
outputs. An additional concern resides in the number of adjustable parameters, 
which drastically increases as outputs increase. We describe the application of 
MANFIS to a two-output inverse kinematics problem in Section 19.3. 

Another way of generating multiple outputs is to maintain the same antecedents 
of fuzzy rules among multiple ANFIS models; Figure 13.1 (left) visualizes this CAN- 
FIS concept. In short, fuzzy rules axe constructed with shared membership values 
to express correlations between outputs. In the following, we shall elaborate on 
CANFIS ideas. 

13.2.2 Architectural Comparisons 

In both CANFIS and RBFN, locality is considered by Euclidean norms between each 
local center and the input vector, as we will see in Figure 13.3. By comparison, the 
inner product of each weight vector and input vector is taken in a backpropagation 
MLP to measure similarity between training patterns. In this section, we compare 
a simple backpropagation MLP with locally tuning models: CANFIS and RBFN. 
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Figure 13.2. (a) A two-input, one-output Sugeno (TSK) fuzzy model; (b),(c),(d) 
equivalent ANFIS/CANFIS architectures; (e) a comparable simple backpropagation 
MLP. 

A single-output CANFIS can be illustrated in the same schematic diagrams of 
ANFIS in Figures 13.2(b) through 13.2(d). When all three neurons (1, 2, 3) have 
identity functions in Figure 13.2(d), the presented CANFIS is equivalent to the 
Sugeno (TSK) fuzzy inference system [43] in Figure 13.2(a), which accomplishes 
fuzzy if-then rules (linear rules) such as the following: 

Rule: If x is A\ and y is B\, then C\ = p\x + qi y + n. 

(For more details, see Section 12.2.) 

Moreover, in Figure 13.2(d), the consequent parts, C\ and C 2 , are expressed 
in a network-layered representation to provide a clearer comparison with a simple 
backpropagation MLP in Figure 13.2(e). This figure also suggests an easy im- 
plementation of such a neuro-fuzzy model by modifying an available baxeboned 
backpropagation MLP program. Let us contrast the neuro-fuzzy model to the well- 
discussed black-box MLP, whose weights axe just numeric connection strengths but 
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not good language to us; the hidden layer of the MLP is tantamount to the con- 
sequent layer of CANFIS. Putting more hidden nodes in the MLP is equivalent to 
adding more rules to CANFIS. The MLP’s weights between the output layer and 
the hidden layer correspond to membership values between the consequent layer 
and the fuzzy association layer in CANFIS. This comparison emphasizes the inside 
transparency of CANFIS. 

From an architectural point of view, CANFIS’s powerful capability stems from 
pattern-dependent weights between the consequent layer and the fuzzy association 
layer. Membership values correspond to those dynamically changeable weights that 
depend on input patterns. That is, CANFIS is locally tuned, like the RBFN dis- 
cussed in Section 9.5. In contrast, the backpropagation MLP with sigmoidal neuron 
functions globally updates weight coefficients for every input pattern, attempting 
to find one specific set of weights common to all training patterns. In other words, 
those weights are used in a global fashion. Hence, the RBFN may need more data 
to achieve a certain accuracy than the MLP, although the RBFN may learns faster 
than the MLP. 

Furthermore, it is said that the backpropagation MLP can be a better extrap- 
olator than RBFN due to its global nature, and that RBFN fails to estimate the 
values of functions outside the range of the training data because of the local nature 
of its hidden receptive fields [6]. This claim may not always be valid; an RBFN with 
normalization may be able to sense beyond the training data set. Thus, questions 
about extrapolation ability still remain. Locally tuning NNs with normalization 
may lead to extrapolation results comparable to the backpropagation MLP (refer 
to the discussion of normalization in Section 13.5.5). In addition, when an RBFN 
has hidden weights C* (i.e., weights between hidden and output layers) that are 
expressed in the form of linear functions, Ci = pix + qiy + r*, such an RBFN can 
be regarded as a sort of compromise between local and global methods because 
linear functions have global natures just like sigmoidal functions. Both ANFIS and 
CANFIS are based on this strength as fusion models, deriving from the first-order 
Sugeno (TSK) FIS. 


13.3 NEURON FUNCTIONS FOR ADAPTIVE NETWORKS 

We shed light on a diversity of neuron functions for adaptive networks in this section. 
We also provide fundamental design aspects for rule formation in CANFIS in light 
of RBFN and modular network features. 

In pursuit of a truly adaptive network, there should be no constraints on neuron 
functions. Various functions are used as basis functions for alternative Gaussian 
functions in RBFNs. Since neuron functions in the RBFN hidden layer correspond 
to MFs in CANFIS, we direct our first step toward exploring the NN literature on 
hidden neuron functions. 
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Figure 13.3. Anatomy of CANFIS and RBFN (top pair), and corresponding con- 
tours of their receptive fields ( middle pair). Two-output CANFIS with four rules 
per each output (upper left) and a comparable RBFN with normalization (upper 
right), which exactly corresponds to Figure 9.10(d). Here both CANFIS and RBFN 
are in adaptive network representation. For comparison purposes, the two bottom 
figures illustrate receptive field contours constructed by CANFIS with six MFs (or 
nine rules) (bottom left), and RBFNs with six basis functions (bottom right). 


13.3.1 Fuzzy Membership Functions versus Receptive Field Units 

illustrating the simplified contours Figure 13.3 provides an anatomical view of CAN- 
FIS in parallel with a corresponding RBFN with normalization, which divides the 
output of each neuron in the output layer by the sum of all basis functions’ outputs. 
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Figure 13.4. Contours of outputs of generalized bell-shaped MFs (left) and trian- 
gular MFs (right) with two inputs using min and product operations. 


This procedure corresponds to the following output function: 

4 4 

°i = ^2 WjC 2 i+j/ w 3 (* = 2 )- 

j=l j= 1 

Also illustrated in this figure are the simplified contours of receptive fields on the 
input x — y plane. Note that in this RBFN figure, we illustrate one of the most 
advanced RBFNs, where the hidden weights Ci are expressed in the form of linear 
functions rather than just real numbers. (For detailed discussions on RBFNs, see 
Section 9.5 in Chapter 9.) The figures on the x — y plane imply that CANFIS 
can construct hyperellipses in higher dimensions through product operations, while 
RBFNs with Gaussian basis functions can form hyperspheres around their centers 
because inputs are plugged into the same basis functions as depicted in Figure 13.3 
(upper right), where inputs, X and Y, both go to the same basis function, R*. In the 
RBFN framework, hyperellipses can be built up by using other basis functions [45], 
such as a generalized Gaussian radial-basis function [23, 33] based on the concept of 
a weighted norm, and an oriented non-radial-basis function [35]; both functions have 
modifiable orientation parameters. Notice that in CANFIS, nine elliptical contours 
are formed on the x — y input plane when three MFs are introduced per input, 
whereas six round contours are formed when six basis functions are introduced in 
an RBFN, as illustrated in Figure 13.3 (bottom pair). In other words, CANFIS 
basically realizes grid partitions while the RBFN realizes scatter partitions (see 
Section 4.5.1). 

To present a clearer illustration of the contours, Figure 13.4 shows the output 
contours resulting from two bell-shaped MFs and triangular MFs using two common 
operations, “min” and “product.” 

An NN with semi-local activation hidden units (or Gaussian-bar hidden units), 
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min(beU(x), be)l(y)) 




bell(x)*bell(y) bell(x)+bell(y) 



Figure 13.5. Outputs of generalized bell-shaped MFs obtained using four represen- 
tative logical operations: min, max, product, and summation. 


which can attain convergence performance comparable to RBFNs, has been re- 
ported [5]. The output response of such an ith unit is given by 


(Wi =)Bi(\\x - ^11) = E jpij exp 


( x j u ijY 


2°f 


(13.1) 


where pij is a positive parameter. By comparison with this summation unit, the 
Gaussian hidden unit in an RBFN can be regarded as a product unit: 


(Wi =)Bi(||x - Will) = n j exp 



(13.2) 


Hence, product operations are indicated conspicuously in the RBFN illustration 
in Figure 13.3 (upper right). In the fuzzy logic literature, those operations are 
treated in a systematic way, as detailed in Section 2.5.2. Figure 13.5 shows resulting 
bell-shaped MF response characteristics, obtained from four representative logical 
operations: min, max, product, summation. Through the summation operation, we 
can realize response characteristics similar to those of the semilocal activation unit, 
defined in Equation (13.1). 

Lane [21] discusses an NN with spline receptive field functions; the spline func- 
tions are also common as fuzzy MFs [36, 44]. Popular MFs are listed in Section 2.4; 
here we present a few MFs for the purpose of further discussion: 


1 


1 + 


x—c 2b ’ 
a 


#*b.ll(z) = 


(13.3) 
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/Wdbeii(z) = max | - t 0 j > ( 13 - 4 ) 

where {a, 6, c} is an adjustable parameter set. The latter definition is a modified 
bell-shaped MF, which has a limited base width (support). 

Asymmetric functions are also common in FIS, such as the following two-sided 
Gaussian MF : 

( gi(x) if x < Ci, 

A*tsg(^) = S 92 (x) if x > c 2 , (13.5) 

y 1 otherwise, 

where gi{x) is the Gaussian function: 


9i(x) = exp 


(x - Ci) 


21 


2a? 


and {a*, c*} are modifiable parameters. 

The more sophisticated MFs tend to have more modifiable parameters. Local 
tuning of parameters within such MFs may be necessary; one of the simplest means 
of local tuning is to fix center parameters and update shape parameters in the train- 
ing phase. Another local tuning method uses different learning rates for adjusting 
those parameters [4]. 


13.3.2 Nonlinear Rule 

In this section, we focus on neuron functions at the consequent layer, such as fa 
and fa] they form the consequent parts C\ and C 2 in Figure 13.2(d). (Note that 
in ANFIS, both fa and fa are typically identity functions.) When we replace them 
with nonlinear functions, we have nonlinear consequents. Accordingly, the neuron 
functions in the consequent layer play an important role in rule formations. 

Suppose we have a sigmoidal function as a neuron function in the consequent 
layer. Then we have a nonlinear consequent, Cnon: 

Chon = t— f / , , yr- (13.6) 

1 + exp[— (pix + q\y + rj] 

In this case, we have a sigmoidal rule (see Figure 13.6, upper pair). 

Furthermore, when each rule’s consequent is realized by an NN, we have neural 
rules, as illustrated in Figure 13.6 (lower left). Note that when the four consequent 
NNs have sigmoidal output neuron functions with no hidden layers, the four neural 
rules are reduced to sigmoidal rules. That is, the upper two diagrams are identical 
in Figure 13.6. 

Although the inside of each neural consequent turns out to be a black box, 
the whole CANFIS model retains the concept of fuzzy reasoning in terms of the 
behavior of the whole system; it still enjoys transparency with respect to MF inter- 
pretability, which is discussed in Section 13.4. One possible way of obtaining more 
precision in a given mapping is to construct the sophisticated neural consequents 
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mth Rule 



Figure 13.7. Anatomical graph of the mth rule associated with the jth output 
neuron, Fj, in C ANFIS. 


without increasing fuzzy rules, although commonly more MFs or more fuzzy rules 
are introduced with little attention to interpretability. 

When such neural consequents are further entwined — that is, when two neural 
consequents, “Neural Rulei” and “Neural Rules” are combined to form one neural 
rule (i.e., Local Expert NNi), and “Neural Rule 2 ” and “Neural Rule 4 ” are fused into 
another neural rule (i.e., Local Expert NN 2 ) — we have a construction similar to that 
of a typical modular connectionist architecture, as illustrated in Figure 13.6 (lower 
right), where the outputs of two local expert NNs are mediated by an integrating 
unit (or typically a gating network) [11, 12, 17, 18, 19, 29, 30]. This formation 
helps to reduce the number of modifiable parameters. In this context, C ANFIS 
with neural rules can be equivalent to modular networks. The central idea resides 
in task decomposition, as discussed in Section 9.6. When the model size is still 
large, the modified bell MF [defined in Equation (13.4)] is more instrumental in 
controlling the number of firing rules than the original bell-shaped MF in that 
parameter-updating procedures for irrelevant or inactive rules can be skipped when 
iterative training procedures are employed. (We shall show a concrete example in 
Chapter 22.) An integrating unit in the modular network corresponds to a fuzzy 
membership value generator in C ANFIS. 

Takagi et al. described this model from a fuzzy logic point of view [41]. Many 
other variations of modular networks have been discussed [42, 31, 22, 10, 20]; their 
design procedures, detailed in ref. [42], are more or less based on an independent 
training scheme whereby antecedent parts (or gating networks) and consequent 
parts are trained individually and then put together. 

Another training approach is to train both antecedent and consequent parts con- 
currently [7, 11, 13, 26]. When we apply simple steepest descent (backpropagation) 
algorithms alone to C ANFIS, the procedure to minimize a sum of squared errors, 
E, is straightforward; Let Oj and Fj be the jth CANFIS output and the jth neuron 
function at the final output layer, respectively, as depicted in Figure 13.7. We then 
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have 

Oj = Fj(NETj), 

where NETj denotes net input. The procedure for updating the mth rule’s conse- 
quent, which has a weight coefficient signified by wm k i, is as follows: 


Awrriki 


dE 

Vwm rw 

owrriki 
dE dNETj 
Vwm dNETj dwm k i 

dE dOj dNETj dAct mk dnet m k 
'Hwm qq. dNETj dActmk dnet m k dwmki 


T]wm r\/~\ Fj (N ETj ) „ n ^ Wmkiact m l 

dUj /Am ”m dwiriki 


- Oj^F^NETj) 


w„ 




’ fmk {j^tmk)dct 


m it 


where rj wm is a learning rate, tj is the j th desired output, W m is the mth mem- 
bership value assigned to the mth rule, and NETj is given by the following three 
equations using W m and the kth output of the mth rule, Act m k : 

Em WmActmk _ W\Act\j + ’ W m Act m k + * ' * 

Em W m ~ W 1 +W 2 + --- + W m + --- 

fmk{P'^mk)i 

y^wmkiactmi, 

l 


NETj = 

Act m k = 
netmk = 


where f m k is the kth output function in the mth rule, and net m k is the total input. 
Accordingly, the procedure for updating the antecedent part, which has a parameter 
denoted by a, can be given by 


A a 


dE 
la da 


dE dOj 
la dOj da 

dE dOj dNETj dW m 
Va dOj dNETj dWm da 


-^^(NETj) (g|- 
-<b{-fe - Oj)}f;(NETj) 
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Em*™ ) ^T 
Actmk - NETj dWm 
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where r) a is a learning rate. 



380 


Coactive Neuro-Fuzzy Modeling: Toward Generalized AN FIS Ch. 13 




Figure 13.8. A truncation filter function (top), and a modified sigmoidal func- 
tion ( middle ) compared with its derivative (bottom) when MAX is 0.9 and MIN is 
0.1. By comparison, dashed curves show a normal sigmoidal logistic function. 


CANFIS with neural rules stands on a sort of complementary mixture viewpoint, 
due to the weighted average of fuzzy membership functions’ outputs. In contrast, 
use of the softmax activation function [25, 2] [defined in Equation (9.26)] in modular 
networks provides a sort of competitive mixing perspective. Section 9.6 in Chapter 9 
provides more detailed description of this perspective. 

13.3.3 Modified Sigmoidal and Truncation Filter Functions 

In this section, we discuss the neuron functions placed at the final output layer [i.e., 
the fuzzy association layer such as /i in Figures 13.2(d) and 13.6]. In particular, we 
introduce a modified sigmoidal function and a truncation filter function 1 . 


1 Here we use the term truncation filter functions rather than piecewise-linear functions, be- 
cause the desired output range is purposely changed to match the output range of the truncation 
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Recall that ANFIS typically possesses the identity function as f\. 

We sometimes expect an NN to specify very small desired outputs. That is, 
the NN is required to learn extreme values close to the rim of the output range. 
When normal sigmoidal logistic functions are introduced at the output layer of an 
NN, it is known that the NN fails to learn such extreme values [34]. One way to 
improve NN performance is to replace the squared error function with an entropic 
function [1, 8, 37]; minimizing error can be regarded as maximizing entropy or log 
likelihood. As an alternative approach, we have introduced a modified sigmoidal 
function, / mo( j, and a truncation filter function, /^ rc , as neuron functions for the 
output layer [26, 28]. They are easily implemented with small modifications, as 
follows: 


/trc(x) 


r MIN 
< MAX 

> X 


/mod( x ) — A 


MIN 

MAX 


/(*) 


if x < MIN 
if x > MAX, 
otherwise 

(13.7) 

if f(x) < MIN 
if f(x) > MAX, 
otherwise 

(13.8) 


where f(x) is the normal sigmoidal logistic function 


X ^ 1 4- exp(— x) 


(13.9) 


This improvement keeps neuron outputs within the desired output range, [MIN, MAX] 
We set MIN to 0.1 and MAX to 0.9, and outputs that are above MAX or below 
MIN are then forced to MAX or MIN. 

To take advantage of the modified sigmoidal function and the truncation filter 
function, we need to change the range of desired outputs (i.e., output scaling). 
Suppose the original range of a desired output ( doi ) is between 0.0 and 1.0. It can 
be linearly changed [24] to [MIN, MAX] by di = dot (MAX— MIN ) 4- MIN: 


0 < doi < 1 => MIN < di < MAX. 

Thus, these functions, ft rc {x) and f m od(x) : prevent an NN from exceeding the 
desired output range. We can effectively teach the NN the boundaries of the desired 
output range by scaling to the interval [MIN, MAX]. Suppose, for instance, that the 
«th desired output (di P ) of training pattern p is MIN. When the ith output (oj P ) is 
smaller than MIN, the error [either di P — ftrc{oi P ) or di P — fmod{oi P )] is zero, due to 
the modified functions defined in Equations (13.7) and (13.8). In other words, that 
pattern p is not used to update the weights associated with the ith. output unit, but 
just skipped. 

Another important alteration of these functions resides in the calculation of 
derivatives. When the neuron output is MIN or MAX, these functions produce 


filter functions. 
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a pseudoderivative; for example, a modified sigmoidal function produces a pseudo- 
derivative whose size is MIN(1 — MIN) or MAX(1 — MAX) as shown in Figure 13.8. 
We can just stick to the same derivative form as that of the normal sigmoidal logistic 
function, given by 

f'(x) = f{x)( 1 - f(x)), 

which has an important computational advantage due to its simple form. The 
amount of weight change is proportional to the magnitude of the derivative. This 
modification produces the derivative that is always larger than a certain size, and 
therefore may have a positive impact on learning extreme values and accelerating 
learning. Similar modification can be found in a method whereby a small positive 
bias derivative is simply added to the original derivative — that is, f'{x) + bias [6]. 
Fahlman [3] suggests use of a small magic constant, 0.1, as the bias derivative. 
(Notice that this derivative advantage is valid when the weight-change derivatives 
are not normalized. For derivative normalization, see Sections 6.7 and 8.3.) The 
aforementioned modifications can be applied to any sigmoidal function, such as the 
hyperbolic tangent function. 

These functions, f trc {x) and f mo d(x), are easily implemented and can surely 
help an NN to learn the boundaries of the output range and extreme values close to 
the rim of the output range. Application examples are presented in Section 13.5.5 
and in Chapter 22. Again, notice that we use these functions only in the output 
layer. 

13.4 NEURO-FUZZY SPECTRUM 

This section explains the concept of neuro- fuzzy spectrum in terms of the trade- 
offs between input-output mapping precision and membership function (MF) inter- 
pretability from the fuzzy logic standpoint. 

Neuro-fuzzy models allow prior knowledge to be embedded via fuzzy rules with 
appropriate linguistic labels, and they offer the possibility of understanding the 
resultant models after learning. On the other hand, black-box neural networks, 
particularly backpropagation multilayer perceptrons, do not have the same level of 
ability to do knowledge embedding and extraction. 

This observation motivates the concept of neuro-fuzzy spectrum, which is defined 
on the interpretability-precision plane depicted in Figure 13.9. Ideally, the learning 
of a neuro-fuzzy model should follow the vertical route to the top in such a way that 
the mapping precision is being improved while the interpretability maintained. In 
practice, however, the learning process often follows the diagonal route of improving 
mapping precision and deteriorating interpretability at the same time. We refer to 
this situation as the dilemma between interpretability and precision. 

Adaptive neuro-fuzzy models like ANFIS /C ANFIS transit smoothly between 
the two ends of neuro-fuzzy spectrum: a completely understandable fuzzy inference 
system and a black-box neural network. 

When we apply advanced optimization techniques, the dilemma becomes more 
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Figure 13.9. Neuro-Fuzzy spectrum. The plane’s axes denote neuro-fuzzy spectrum 
(horizontal route) and input-output mapping precision (vertical route). Ideally, the 
learning of a neuro-fuzzy model should follow the vertical route to the top, but it 
often takes the diagonal route of improving mapping precision at the expense of 
interpretability. 


conspicuous. For example, given a fixed amount of computation time, the Levenberg- 
Marquardt (LM) method, discussed in Chapter 6, can achieve higher mapping pre- 
cision than the steepest descent method. Yet the resultant MFs obtained by the LM 
method may vary significantly from their initial setups; consequently, we may lose 
the original interpretability designed for the initial membership functions [16]. In 
other words, the sophisticated methods may attain a higher input-output mapping 
precision, but it may lead to meaningless fuzzy rules, accordingly. 

Similar observations can be found when the number of rules is increased; the 
resulting MFs may not lend themselves to good linguistic interpretation. This issue 
is considered in Sections 13.5.2 and 22.4.2. The MFs should be carefully set up so 
that fuzzy rules can be held to meaningful limits (see also Section 22.4.1). 

If linguistic interpretability is not a concern, we are entitled to choose the most 
efficient learning algorithm. In this case, such a neuro-fuzzy model stands on the 
black-box NN endpoint on the neuro-fuzzy spectrum, attempting to achieve as high 
precision as possible. However, if linguistic interpretability is a concern, then we 
may need to update MF parameters carefully. Even if we use the simple steep- 
est descent (backpropagation) algorithm, we cannot guarantee the resultant in- 
terpretability. This is demonstrated in Section 13.5.2. The hybrid learning rule 
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generally gives interpretable results, as discussed in Section 13.5.3, but this is not 
always guaranteed. 

There are various approaches to alleviating the dilemma: 

• Apply another way of interpretation to ANFIS/C ANFIS; many other inter- 
pretable NN models have structures similar to RBFN or modular networks. 
Actually, in a variety of studies of Bayesian networks and probabilistic net- 
works [9, 32, 38, 39, 40], some networks had a configuration similar to RBFN. 
Probabilistic interpretation can also be put on modular networks [29, 18]. This 
implies that a given ANFIS/CANFIS structure can be interpreted from dif- 
ferent viewpoints, regardless of calling it a fuzzy system. But such a different 
interpretability spectrum is beyond the scope of this book. 

• Change MF types, or adopt a more sophisticated asymmetric MFs, such as 
a two-sided bell MFs. (See Section 13.5.3.) Note that the interpretation 
of MFs ’ shapes is still questionable. For instance, if the resultant MFs end 
up having complicated shapes, we may have no idea about how to interpret 
each of them. This is because we usually pay much attention to the global 
relationships between neighboring MFs, rather than the local peculiarity of a 
single MF. 

• Modify the learning algorithms that maintain MF interpretability. The hybrid 
learning algorithm has been contrived in this spirit (see Section 13.5.3). 

• Alter fuzzy rules’ structures by setting up nonlinear rules, as discussed in 
Section 13.3.2. Particularly, CANFIS with neural rules may have less chance 
to lose MF interpretability than linear rules, because neural rules have more 
learning power than linear counterparts. Such rules’ learning power may pre- 
vent MFs from varying a lot during the learning phase. Although MF in- 
terpretability can be maintained, rules’ consequent parts may become hard 
to understand due to their neural network structures (see Figure 13.15 in 
Section 13.5.4). 

• Put proper constraints on neighboring MFs so that resultant MF interpretabil- 
ity can be retained. The most simplest way is to apply some knowledge to 
fixing the center positions of MFs. 

• Formulate a new error measure designed to increase interpretability, such as 
an error measure with a term similar to Shannon’s information entropy, as 
suggested in ref. [15]. 

• Transform the input space to another space, in which input values can be 
treated in a linguistically meaningful way, as discussed in Section 22.4.1. 
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Figure 13.10. Results of N-shape problem obtained by ANFIS: CANFIS with two 
linear rules (a, b), and a simple backpropagation MLP with three hidden units (c) 
and with seven hidden units (d). 

13.5 ANALYSIS OF ADAPTIVE LEARNING CAPABILITY 

We shall use the following two examples to clarify further CANFIS’s learning ca- 
pabilities. We also discuss the overall performance of the steepest descent method 
alone in contrast to the hybrid learning algorithm. These problem examples are 
trivial, but they provide us an insight into the power of CANFIS. (In Chapter 22, 
we shall discuss the application of CANFIS to a more difficult problem.) 

13.5.1 Convergence Based on the Steepest Descent Method Alone 

The first simulation example is simple: fitting an N-shaped letter that has two 
corners: a pointed top left-hand corner and a rounded right-hand corner. The 
target letter is shown as a dashed line in Figure 13.10(a). 

To see how adaptive capability depends on architecture itself, five CANFIS 
models (A)-(E) were trained on the basis of the same conventional steepest de- 
scent (backpropagation) algorithm with a fixed momentum (0.8) and a small fixed 
learning rate. (For detailed discussions on the steepest descent methods, refer to 
Chapter 6.) They are (A) CANFIS with linear rules (i.e., ANFIS), (B) CANFIS 
with sigmoidal rules, and (C),(D),(E) CANFIS with neural rules; each neural rule 
has two hidden units (cf. Figure 13.6, lower left). CANFIS (E) has a sigmoidal func- 
tion at the fuzzy association layer to generate final outputs, whereas CANFIS (C) 
and (D) have identity functions. The difference between CANFIS (C) and (D) is a 
neuron function in the output layer put within a neural consequent (e.g., functions 
3, 4, 5, and 6 in Figure 13.6, lower left). Specifically, each neural rule or local expert 
NN has a sigmoidal function to produce the rule’s output in CANFIS (C) or has an 
identity function in CANFIS (D). The results are shown in Table 13.1. By contrast, 
Table 13.2 shows the results from four simple backpropagation MLPs. 

Figure 13.10 shows a comparison of results from CANFIS and simple backprop- 
agation MLPs. CANFIS with two linear rules is able to capture the peculiarity of 
the N-shape, as shown in Figure 13.10(b), fitting well both the pointed and round 
corners of the N-shape. A systematic training procedure leads the bell-MFs to the 
results presented in Figure 13.10(b). (They never cease to amaze us in that their 
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Table 13.1. Root-mean-squared errors (xl0~ 2 ) of five CANFIS models. Rules of 
CANFIS models (A, B, C, D) are shown in Figure 13.15. The column “Para. #” 
denotes “Parameter number. ” 


CANFIS 

Rule 

Function at rules’ 

Function at fuzzy 

Checking 

Training 

Para. 

model 

formation 

consequent layer 

association layer 

error 

error 

# 

(A) 

Linear 

Identity 

Identity 

1.38 

1.35 

10 

(B) 

Sigmoidal 

Sigmoid 

Identity 

8.22 

7.45 

10 

(C) 

Neural 

Sigmoid 

Identity 

2.40 

2.08 

20 

(D) 

Neural 

Identity 

Identity 

1.78 

1.63 

20 

(E) 

Neural 

Identity 

Sigmoid 

4.57 

4.07 

20 


Table 13.2. Root-mean-squared errors fxlO 2 ) of four simple backpropagation 
MLPs (three single-hidden-layered MLPs and one two-hidden-layered MLP). 


# of hidden units 

3 

5 

7 

3x3 

Checking error 

9.91 

4.99 

2.84 

2.05 

Training error 

8.64 

4.75 

2.73 

1.84 

Parameter # 

10 

16 

22 

22 


neat figures can metamorphose into such unexpected shapes; manually tuned MF 
shapes may not match them.) 

Due to the discontinuity of the left-hand corner of the N-shape, MLPs with a 
small number of hidden neurons (three or four), possessing sigmoidal neuron func- 
tions, do not evolve to piecewise fit the pointed corner, as shown in Figure 13.10(c), 
in spite of having a comparable number of adjustable parameters (compare the pa- 
rameter numbers in Tables 13.1 and 13.2). They never reach the fitting level of 
Figure 13.10(b) within a preset iteration limit (200,000). 

These results reinforce the strength of CANFIS with generalized bell MFs de- 
fined in Equation (13.3) in the convergence capacity. In the following three sections, 
we consider CANFIS’s learning ability from the standpoint of transparency. 

13.5.2 Interpretability Spectrum 

When three rules are introduced, it is observed that two MFs (MF2, MF3) transit 
back and forth at the beginning of their evolution as if struggling to find comfortable 
niches [see Figure 13.11(d)]. After that, their tracks merge into one. Figure 13.11(c), 
resulting from Figure 13.11(b), shows that they eventually took almost the same 
center position. Tuning MFs based on supervised learning algorithms sometimes 
leads us to meaningless fuzzy rules, as in this example. Because the bell MF used 



Sec. 13.5. Analysis of Adaptive Learning Capability 


387 



Figure 13.11. (a),(b),(c) How to adapt to the N-shape when using CANFIS with 
three linear rules based on the steepest descent method alone, (d) Trajectories of the 
center positions of three MFs during the training phase. Although actual outputs 
(solid line) were almost perfectly fitted to the desired N-shape (dashed line) in (c), 
there is still a question regarding how to extract rules from fuzzy logic standpoint. 


in this simulation has a symmetric neat shape, without having local/independent 
tuning of its own parameters, the MF may have no choice other than to do unnec- 
essary movement and therefore may end up hiding itself inside another MF, as in 
Figure 13.11. 

Now we encounter the limitation of understandable fuzzy rules. It is actually 
beyond our expectation that the two antecedents of Rule 2 and Rule 3 ended up 
being the same. The linguistic labels assigned to them, “medium” and “large,” 
turned out to be meaningless. This exemplifies the dilemma between precision 
and interpretability, discussed in Section 13.4. 

This may imply that the initial three fuzzy rules should be two fuzzy rules; 
either Rule 2 (MF2) or Rule 3 (MF3) may be redundant, which may be a clue 
about selection of appropriate fuzzy partitions. Indeed, CANFIS with two rules 
can accomplish the task well, but CANFIS with three rules fits the N-shape better, 
obtaining higher precision within a fixed amount of computation time. When we 
split the acquired three rules in Figure 13.11(c) into two combinations — “Rule 1 and 
Rule 2” and “Rule 1 and Rule 3” — neither output fits the N-shape; obviously, the 
three rules help each other in improving overall performance. Adding more fuzzy 
rules may result in lack of interpretability or ill-defined fuzzy rules. This issue will 
be discussed in another example in Section 22.4.2 of Chapter 22. 

13.5.3 Evolution of Antecedents (MFs) 

We notice that the resulting MF occupancies depend on a training method as well 
as an MF type. As in all results in Figures 13.12 and 13.13 obtained by the hybrid 
learning algorithm, there are no MFs that share the same center position. In this 
simulation, when original bell MFs were introduced, CANFIS based on the steepest 
descent method alone unexpectedly converged faster than CANFIS with the hybrid 
learning procedure. Moreover, CANFIS based on the hybrid learning procedure did 
not fit the N-shape very well, as Figures 13.12(a) and 13.12(b) show, while CANFIS 
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Figure 13.12. Results of the N-shape problem obtained by CANFIS with three 
linear rules based on the hybrid learning algorithm using original bell MFs (left) 
and using asymmetric MFs (right). ( a)(b ) The rules’ outputs. ( c)(d ) The adapted 
MFs after the training phase. 


with the steepest descent method alone recognized the features of the N-shape well, 
as shown in Figure 13.11(c). In many cases, the hybrid learning algorithm works 
better than the steepest descent method alone, but it may be worth considering 
possible reasons for this. The hybrid learning procedure strongly predominates 
where intuitively positioned MFs do not need to evolve very much. Initially, LSE 
(least-squares estimation) may specialize rules’ consequents to a great extent, which 
may prevent MFs from evolving. LSE can find certain rules’ consequent values that 
have minimal errors with the current MF setup, but after updating coefficients, it 
may end up losing its way to a better fitting level. 

Figures 13.12(c) and 13.12(d) show results obtained using the asymmetric two- 
sided Gaussian MF defined in Equation (13.5) based on the hybrid learning al- 
gorithm; the resulting positions of the three MFs are different from those in Fig- 
ure 13.11(c). For comparison purposes, Figure 13.13 shows the results obtained by 
CANFIS with linear rules using four different MF types in accordance with the hy- 
brid learning algorithm. The four types are triangular MFs (trimf), trapezoidal MFs 
(trapmf), Gaussian MFs (gaussmf), and difference of two sigmoidal MFs (dsigmf). 
See Section 2.4 for their definitions. 

Without a priori knowledge, initial MF and rules’ consequent arrangements may 
not be perfect; trusting initial guesses may turn out to be obstacles to obtaining 
better results. Note that the resultant MFs in Figure 13.10(b) may not match any 
initial guess. 

We now assume a case in which we acquire two rules’ consequents from data sets. 
Is it a good idea to stick to them? Figure 13.14 shows results obtained using the 
two acquired rules’ consequents, which purposely coincide with the two side lines 
of the N-shape; Figure 13.14(b), resulting from 13.14(a), suggests that clinging to 
those two consequents may not be useful in obtaining a good fit. 

On the other hand, Figure 13.14(d) shows that CANFIS is able to adapt to 
the N-shape to some extent even when the initial MFs are poorly set up, as in 
Figure 13.14(c), where there is almost no intersection between the two modified 
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Figure 13.13. Results of N-shape problem obtained by CANFIS with linear rules 
using four different MF types based on the hybrid learning algorithm. The four types 
are triangular MFs (trimf), trapezoidal MFs (trapmf), Gaussian MFs (gaussmf), 
and difference of two sigmoidal MFs (dsigmf). The training RMSE (root mean 
squared error) and checking RMSE are also presented together in tm RMSE and 
chk RMSE. 


bell MFs defined in Equation (13.4). [In Figure 13.14(d), the left-hand base of 
MFi reached the pointed corner of the N-shape, and therefore the evolved MF l 
has a shape different from the MF\ in Figure 13.10(b).] 

13.5.4 Evolution of Consequents (Rules) 

There must be some optimal combinations of “shapes” of MFs and “forms” of rules’ 
consequents. Figure 13.15 shows some of them. 

Interestingly enough, outputs of the adapted rules’ consequents depicted in Fig- 
ures 13.15(a)-(d) end up being different and far from the desired N-shape. We have 
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Figure 13.14. Two results of the N-shape problem based on two fixed (i.e., man- 
ually tuned and then frozen ) rules using original bell MFs (a),(b) and modified bell 
MFs (c),(d). Only MFs were trained. 


trained both antecedent and consequent parts simultaneously. Thus, each rule’s 
output does not have to fit the desired output, N-shape; the final combined outputs 
fit it. 


13.5.5 Evolving Partitions 


We have discussed both the truncation filter function and the modified sigmoidal 
function in Section 13.3.3. In this section, we show the effectiveness of those func- 
tions in training C ANFIS and simple backpropagation MLPs, using a small classi- 
fication problem. Furthermore, we present how resulting CANFIS partitions evolve 
through learning. 

The data set consisted of 80 patterns in two-dimensional pattern space as illus- 
trated in Figure 13.16. Those patterns had to be classified to four categories; the 
patterns should be mapped into the following two-dimensional vectors: ( ON, ON) 
for class 1, (ON, OFF ) for class 2, (OFF, ON) for class 3, and (OFF, OFF) for 
class 4 where ON is set to 0.9 and OFF is set to 0.1. 

Because we use a common sum of squared error measure and a soft-limiting 
neuron function that has sigmoidal characteristics, we specially define the follow- 
ing criterion so that the outputs, Oi (i — 1 , 2), can be regarded as ON (0.9) or 
OFF (0.1): 


if Oi > 0.8 Oi is classified to ON (0.9), 
if Oi < 0.2 Oi is classified to OFF (0.1), 
otherwise Oi is “undecided.” 


(13.10) 


Note that such classification problems may be treated in various ways, such as 
by employing an error measure different from sum of squared error measure [14], 
by using hard-limiting activation functions, and so forth. A good discussion of 
classification problems using the TLU (threshold logic unit) can be found in ref. [46]. 

The classification results are shown in Table 13.3; we tested the four presented 
models: two CANFIS models with three MFs per input using nine linear rules 
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Figure 13.15. Different outputs of optimized rules’ consequents after training 
phase. They are generated by the four CANFIS models: CANFIS with linear 
rules (a), sigmoidal rules (b), and neural rules (c) and (d). The corresponding 
results are tabulated in Table 13.1. 


per output, and two simple backpropagation MLPs with 14 hidden units. All four 
models have the same number of adjustable parameters, 72. The “stopped epoch” 
shows when the models classified all 80 patterns correctly according to the preceding 
criterion defined in Equation (13.10). Training was performed using the steepest 
descent method alone up to the preset iteration limit, 10,000. 

The better results from both CANFIS^ rc and NN m0( j confirm that the trun- 
cation filter function and the modified sigmoidal function help in learning given 
mappings. (See the exercises at the end of this chapter for more discussion of these 
results.) Note that the use of both the truncation filter and the modified sigmoidal 
function are not confined to classification problems; another example can be seen 
in Chapter 22. 
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Figure 13.16. A classification problem example: two-dimensional input space with 
four classes. 


Table 13.3. Results of the classification problem (depicted in Figure 13.16); two 
CANFIS models and two backpropagation MLPs are compared. The values in this 
table are averaged over seven trials. Our preset limit epoch was 10,000. Note that 
“incorrect patterns” include “undecided patterns.” 



# of incor- 
rect patterns 

Stopped 

epoch 

Neuron function at 
the output layer 

CANFIS trc 

0 

855.6 

Truncation filter function 

CANFIS^ 

49.6 

10,000.0 

Identity function 

NNmod 

0 

1,993.3 

Modified sigmoidal function 

NNnorm 

0 

5,422.4 

Normal sigmoidal function 


Figure 13.17 depicts the initial MF setup and the resulting MF setup obtained 
at a stopped iteration. Figure 13.18 shows the initial partitionings and the resulting 
partitionings constructed by the obtained six MFs. Also, Figure 13.18 illustrates 
different constructed surfaces before and after MF normalization. 

The initial MFs seem to have excessive overlap to the extent that nine peaks 
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(a) Before Normalization (c) Before Normalization (e) Before Normalization 



Figure 13.17. Initial MFs and resulting MFs; initial set up of six MFs on x or 
y before normalization (a) and after normalization (b); trained MFs on x before 
normalization (c) and after normalization (d); trained MFs on y before normaliza- 
tion (e) and after normalization (f) . 


are vaguely recognizable in the initial output surfaces in Figure 13.18 (upper left). 
Remember, however, that we usually have almost no chance of achieving perfect 
initial MF arrangements; here, we have purposely shown how CANFIS learns from 
poor MF setups. We can see that the constructed surfaces are very different after 
normalization. Normalization can help to “see” beyond the range of the training 
data, as suggested in Section 13.2.2. Note that an RBFN approach requires nine 
basis functions to construct such nine peaks as in Figure 13.18 (upper left), as we 
discussed in Figure 13.3. 

13.6 SUMMARY 

We have studied in depth extended ANFIS ideas, exploiting various conceptual 
CANFIS architectures; CANFIS bears a close relationship to the computational 
paradigms of RBFNs and modular networks. As preparations for use in practical 
environments, it is worthwhile to investigate CANFIS’s strengths and weaknesses 
by using trivial examples. 

Particularly, we have described several problems faced in designing CANFIS in 
light of the neuro-fuzzy spectrum. Automatic rule extraction is a useful aspect of 
adaptive neuro-fuzzy models. Yet we should pay attention to the limitation behind 
many successful reports: Acquired rules may sometimes be hard to understand. 
Also, the initial number of MFs should be carefully determined so that fuzzy rules 
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Figure 13.18. Comparison of output surfaces constructed by the six MFs before 
normalization and after normalization; the initial output surfaces before normal- 
ization (upper left) and after normalization (lower left); the final output surfaces 
before normalization (upper right) and after normalization (lower right). 


can be held to meaningful limits. In this regard, it may be futile to pursue inter- 
pretability from a fuzzy logic standpoint. 

Although C ANFIS may face such difficulties, most likely it will outperform 
manually designed systems; the adaptive learning procedure helps. To obtain higher 
precision within a fixed amount of computation time and to limit the number of 
MFs so we can also retain interpretability, we may construct nonlinear rules such 
as neural rules. Or we may prefer to choose a more sophisticated MF, such as 
a generalized bell MF and an asymmetric MF. Moreover, manipulating neuron 
functions may have a beneficial effect on performance enhancement. The question 
is, what combinations of them are best? Exploring this idea further is a good small 
step toward finding an ideal model of a truly adaptive network. 

To investigate further the empirical observations discussed in this chapter, we 
shall apply this CANFIS modeling to a more realistic problem in Chapter 22 (see 
Figures 22.3 and 22.4). 
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Table 13.4. Experimental results in finding an optimal learning rate using an 
MLP with modified sigmoidal functions or normal sigmoidal functions at the output 
layer. The limit epoch was set to 10,000. Note that “incorrect patterns” include 
“undecided patterns. ” 



Modified sigmoidal function 

Normal sigmoidal function 

Learning 

rate 

Stopped 

epoch 

# of incor- 
rect patterns 

Stopped 

epoch 

# of incor- 
rect patterns 

1.4 

6,958 

5,657 

0 

10,000 

2 

1.3 

0 

3,925 

0 

1.2 

5,582 

0 

10,000 

15 

1.1 

3,814 

0 

9,390 

0 

1.0 

2,628 

0 

2,515 

0 

0.9 

2,362 

0 

2,549 

0 

0.8 

1,854 

2~035 

0 

2,066 

0 

0.7 

0 

6,578 

0 

0.6 

2,545 

0 

5,852 

0 

0.5 

2,505 

0 

2,520 

0 

0.4 

3,006 

0 

10,000 

1 

0.3 

4,047 

0 

10,000 

1 

0.2 

3,143 

0 

5,496 

0 

0.1 

10,000 

1 

10,000 

4 

0.05 

10,000 

10 

10,000 

16 


EXERCISES 

Concerning the classification problem discussed in Section 13.5.5, answer the 
following questions. 

1. When we introduce four-dimensional target vectors — (ON, OFF, OFF, OFF) 
for class 1, (OFF, ON, OFF, OFF) for class 2, (OFF, OFF, ON, OFF) for 
class 3, and (OFF, OFF, OFF, ON) for class 4 — what disadvantage will CAN- 
FIS suffer in contrast to simple backpropagation MLPs? 

2. Assuming that we explore various learning rates on the basis of pattern-by- 
pattern learning, we get the results shown in Table 13.4. (Deciding an optimal 
learning rate requires a rule of thumb; it is actually state-of-the-art.) What 
advantage can be gained by using an MLP with modified sigmoidal functions 
as opposed to an MLP with normal sigmoidal functions? 
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Table 13.5. Comparison of actual outputs of the four models (CANFISfr c , 
C ANFIS NN m0( }, and NN n0 rm) for four patterns. Those sample outputs were 
picked at their stopped epoch. ” 



Pattern 

#1 

Pattern 

#2 

Pattern 

#3 

Pattern 

#4 


0.9 

0.9 

0.9 

0.1 

0.1 

0.9 

0.1 

0.1 


0.9 

0.9 

0.9 

0.1 

0.1 

0.9 

0.1 

0.1 

CANFIS ifl 

0.8788 

0.8879 

0.9654 

-0.0757 

-0.1303 

-0.0006 

-0.0755 

-0.1830 

NN mnH 

0.9 

0.9 

0.9 

0.12581 

0.1 

0.9 

0.1 

0.1 

NNnorm 

0.9980 

0.9999 

0.9183 

0.1120 

0.0744 

0.8972 

0.0005 

0.0016 


3. In Table 13.3, CANFIS^ (CANFIS with identity functions) shows poor perfor- 
mance. Specify a possible reason, looking at the comparison of actual outputs 
of the four models appearing in Table 13.5. 

4. Suppose we get the RMSE [between the outputs Oi (i = 1, 2) and the desired 
outputs ON or OFF] result shown in Figure 13.19. Can we conclude, in light 
of Criterion 13.10, that an MLP with modified sigmoidal functions is doing a 
better job than an MLP with normal sigmoidal functions in classifying given 
patterns? (If not, why not?) 
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Chapter 14 


Classification 
and Regression Trees 


J.-S. R. Jang 


14.1 INTRODUCTION 

We introduced the ANFIS (adaptive neuro-fuzzy inference system) network archi- 
tecture in Chapter 12, including its configuration, learning rules, and several appli- 
cation examples. However, the learning rules (or any other parameter-level adapta- 
tion methods) only deal with parameter identification; we still need methods for 
structure identification to determine an initial ANFIS architecture before any 
parameter-tuning procedures can take over. By having solid methods for both struc- 
ture and parameter identification, we are completing the cycle for fuzzy modeling 
introduced in Section 4.5.2. 

Structure identification in fuzzy modeling involves the following primary issues: 

• Selecting relevant input variables 

• Determining an initial ANFIS architecture, including 

1. Input space partitioning (see Section 4.5.1) 

2. Number of membership functions (MFs) for each input 

3. Number of fuzzy if-then rules 

4. Antecedent (premise) parts of fuzzy rules 

5. Consequent (conclusion) parts of fuzzy rules 

• Choosing initial parameters for MFs. 

In this chapter, we shall assume that the tree partition (see Section 4.5.1) has 
been adopted for our fuzzy modeling task. Based on the CART (classification and 
regression tree) algorithm [1], this chapter introduces a quick method for solving 
the problem of structure identification. The proposed method generates a tree 
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partitioning of the input space, which relieves the “curse of dimensionality” problem 
(number of rules increasing exponentially with number of inputs) associated with 
the grid partitioning described in Section 4.5.1. Moreover, the resulting ANFIS 
architectures based on CART are more efficient in both training and application 
because of their implicit weight normalization. 

We start with a brief introduction to decision trees and the CART algorithm used 
to derive them. Following that, we explain how to transform CART-derived decision 
trees into efficient ANFIS network structures with implicit weight normalization. 

14.2 DECISION TREES 

A decision tree partitions the input space (also known as the feature or attribute 
space) of a data set into mutually exclusive regions, each of which is assigned a label, 
a value, or an action to characterize its data points. The decision tree mechanism 
is transparent and we can follow a tree structure easily to explain how a decision 
is made. Therefore, the decision tree method has been used extensively in machine 
learning, expert systems, and multivariate analysis; it is perhaps the most highly 
developed technique for partitioning sample data into a collection of decision rules. 

A decision tree is a tree structure consisting of internal and external nodes 
connected by branches. An internal node is a decision-making unit that evaluates 
a decision function to determine which child node to visit next. In contrast, an 
external node, also known as a leaf or terminal node, has no child nodes and 
is associated with a label or value that characterizes the given data that lead to 
its being visited. In general, a decision tree is employed as follows. First, we 
present a datum (usually a vector composed of several attributes or elements) to 
the starting node (or root node) of the decision tree. Depending on the result of a 
decision function used by an internal node, the tree will branch to one of the node’s 
children. This is repeated until a terminal node is reached and a label or value is 
assigned to the given input data. 

In the case of a binary decision tree, each internal node has exactly two children, 
so a decision can always be interpreted as either true or false. Of all decision trees, 
binary decision trees are the most often used because of their simplicity and our 
extensive knowledge of their characteristics. 

Decision trees used for classification problems are often called classification 
trees, and each terminal node contains a label that indicates the predicted class of 
a given feature vector. In the same vein, decision trees used for regression problems 
are often called regression trees, and the terminal node labels may be constants 
or equations that specify the predicted output value of a given input vector. Fig- 
ure 14.1(a) is a typical binary regression tree with two inputs x and y and one 
output 2 . As shown in Figure 14.1(b), the decision tree partitions the input space 
into four non-overlapping rectangular regions, each of which is assigned a label fi 
(which could be a constant or an equation) to represent a predicted output value. 
Note that each terminal node has a unique path that starts with the root node 
and ends at the terminal node; the path corresponds to a decision rule that is a 
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Figure 14.1. (a) A binary decision tree and (b) its input space partitioning. 


conjunction (AND) of various tests or conditions. For any given input vector, one 
and only one path in the tree will be satisfied. 

Example 14.1 Input-output surfaces of regression trees 

When the labels of terminal nodes in a regression tree axe constants, the resulting 
input-output mapping looks like several constant-height planes put together with 
sharp boundaries. Figure 14.2(a) is a typical example that shows the input-output 
surface of the decision tree in Figure 14.1 with a = 6, b = 3, c = 7, fi = 1, 
f 2 = 3, fa = 5, and fa = 9. On the other hand, if we assign a linear function of 
the input variables to each terminal node, then the resulting surface is piecewise 
linear, as shown in Figure 14.2(b), where fa — 2x — y — 20, fa = — 2x + 2y + 10, 
fa = 6x — y + 5, and fa = 3x + Ay + 20. Apparently, a regression tree is a very 
easy-to-interpret representation of a nonlinear input-output mapping. However, the 
discontinuity at the decision boundaries [say, x = a in Figures 14.2(a) and 14.2(b)] is 
unnatural and brings undesired effects to the overall regression and generalization. 

□ 

Before describing decision-tree induction in the next section, we must first intro- 
duce some important nomenclature for binary trees. A typical binary tree, as shown 
in Figure 14.3(a), is usually denoted as T with the root node t\. A generic node 
in T is denoted by t , and the subtree with t as a root node is usually represented 
by Tt, as shown in Figure 14.3(a), where t = 1 3. We use T to denote the set of 
terminal nodes in a tree T; the number of terminal nodes is thus represented by 
|T| [which is equal to 5 in the case of Figure 14.3(a)]. It is easy to prove that in 
a complete binary tree (where each node has zero or two children), the number of 
terminal nodes is always one more than the number of internal nodes. 
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Figure 14.2. Input-output surfaces of decision trees with terminal nodes charac- 
terized by (a) constants and (b) linear equations in Example 14-1 ■ (MATLAB files: 
tsurfl.m and tsurf 2 .m) 



Figure 14.3. (a) A typical tree T with root node t\ and a subtree T t3 ;(b)T-T t3 , 
the tree after shrinking the subtree Tts into a terminal node 1 3 . 


An example of tree pruning is shrinking the subtree Tt 3 in Figure 14.3 into a 
terminal node. The tree after pruning, denoted as T — T t3 , is a subset of the original 
tree, and this is usually expressed as 


T-T t 3 C T. 

14.3 CART ALGORITHM FOR TREE INDUCTION 

The use of tree-based classification and regression dates back to the AID (Automatic 
Interaction Detection) program of Morgan and Sonquist [3] . Methods of decision- 
tree induction from sample data, also known as recursive partitioning, have since 
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been an active topic in artificial intelligence (particularly machine learning) and 
statistics (in particular, multivariate analysis) research communities. In machine 
learning literature, the most representative method of decision-tree induction are 
the ID3 [4] and C4 [5] procedures proposed by Quinlan; detailed treatments can 
be found in ref. [6]. Similar problems were approached by statisticians at about the 
same time, and the most well-known work was published by Breiman et al. [1] in 
their monograph entitled Classification and Regression Trees ; thus the methodology 
is often referred to as the CART algorithm. The fundamentals of ID3 and CART 
are similar; the major distinction is that CART induces strictly binary trees and uses 
resampling techniques for error estimation and tree pruning, while ID3 partitions 
according to attribute values. 

This section presents a summary of the CART procedure, and we confine our 
scope to the discussion of binary trees only; the extension to n-ary trees is straight- 
forward. The material presented in this section will serve as a background for 
understanding the use of CART for structure identification in ANFIS in the next 
section. 

To construct an appropriate decision tree, CART first grows the tree extensively 
based on a sample (training) data set, and then prunes the tree back based on a 
minimum cost-complexity principle [1]. The result is a sequence of trees of various 
sizes; the final tree selected is the tree that performs best when another independent 
(checking or test) data set is presented. In summary, the CART procedure consists 
of two parts: tree growing and tree pruning. 

14.3.1 Tree Growing 

CART grows a decision tree by determining a succession of splits (decision bound- 
aries) that partition the training data into disjoint subsets. Starting from the root 
node that contains all the training data, an exhaustive search is performed to find 
the split that best reduces an error measure (or cost function). Once the best split 
is determined, the data set is partitioned into two disjoint subsets accordingly; the 
subsets are represented by two child nodes originating from the root nodes, and the 
same splitting method is applied to both child nodes. This recursive procedure ter- 
minates either when the error measure associated with a node falls below a certain 
tolerance level, or when the error reduction resulting from further splitting will not 
exceed a certain threshold value. 

Classification Trees 

Classification trees are used to solve classification problems in which attributes 
of an object are used to determine what class the object belongs to. To grow 
a classification tree, we need to have an error measure E(t) that quantifies the 
performance of a node t in separating data (or cases) from different classes. The 
error measure for classification trees is often referred to as the impurity function; 
for a given node (or equivalent, for a data set), it should attain a minimum at zero 
when the given data all belong to the same class, and reach a maximum when the 
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data are evenly distributed through all possible classes. A formal definition of the 
impurity function for J-class problems is given next. 

Definition 14.1 The impurity function for J-class problems 

The impurity function 0 is a J-place function that maps its input arguments 
Pi, P 2 , • • • , Pj, with Ylj=i Pj = 1> into a non-negative real number, such that 


0(1/ J, 1/ J, • • • , 1/ J) = maximum, 

0 ( 1 , 0 , 0 , • ■ • , 0 ) = 0 ( 0 , 1 , 0 , ■ • • , 0 ) = 0 ( 0 , 0 , 0 , • • ■ , 1 ) = 0 . 


(14.1) 


The input arguments pj , j = 1 to J, is the probability that a case in a node belongs 
to class j. Therefore, the impurity function for a given node is largest when all 
classes are equally mixed in the node, and is smallest when the node contains cases 
from only one class. 


By using the impurity function 0, the impurity measure of a node t is expressed 
as 

E(t) = (f>(pi , P 2 , ' ’ ’ , Pj ) , 

where pj is the percentage of cases in node t that belong to class j. Similarly, the 
impurity measure of a tree T can be expressed as 


E(T) = j2m, 


where T is the set of terminal nodes in tree T. 

The best known impurity functions for a J-class classification tree are the en- 
tropy function and the Gini diversity index [1]. 


Entropy function: 0 e (pi , • • * , Pj) = ~ Y^j = i Pj ln Pj 5 

Gini index: 0<? (pi , • * • , Pj) = PiPj = 1 “ Z/=i P 2 j • 


(14.2) 


Since ]Cj=i Pj = 1 and 0 < Pj < 1 for all j , the preceding two functions are always 
positive unless one of pj is unity and all the others are zero. Moreover, they reach 
their maxima when pj = 1/J for all j. The proof of these properties is left as an 


exercise. 


Example 14.2 Entropy and Gini function visualization 

Both the entropy function and the Gini index can be expressed as functions of Pj, 
j = 1 to J — 1, if we plug pj = 1 — Yli=i Pj i nt0 Equation (14.2). In other words, 
both impurity functions can be visualized as curves when J is 2 and as surfaces 
when J is 3, as shown in Figure 14.4. When J is 3, both functions are not defined 
for pi + P 2 > 1, since p% = 1 - p\ — P 2 should always be kept non-negative; this is 
clearly shown in Figures 14.4(b) and 14.4(d). 
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(a) Entropy Function with J = 2 (b) Entropy Function with J = 3 



Figure 14.4. Impurity functions for classification trees: (a) entropy function for 
two-class problem; (b) entropy function for three-class problem; (c) Gini index for 
two-class problem; (d) Gini index for three-class problem. Both the curves in (a) 
and (c) reach their maxima atpi = p% = 1/2; both the surfaces in (b) and (d) reach 
their maxima at p\ = P 2 = P 3 = 1/3. (MATLAB file: impurity .m) 


□ 

Given an impurity function for computing the cost of a node, the tree-growing 
procedure tries to find an optimal way to split the cases (or objects) in the node 
such that the cost reduction is the greatest. In a binary tree, the impurity change 
due to splitting is 

A E(s, t ) = E(t) - piE(t t ) - p r E(t r ), (14.3) 

where t is the node being split; E(t) is the impurity of the current node t; E(ti) and 
E(t r ) are the impurities of the left and right branch nodes; and pi and p r axe the 
percentages of cases in node t that branch left and right, respectively. In symbols, 
the-tree growing procedure tries to find a split s* for the root node t\ such that the 
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split gives the largest decrease in impurity: 

AE(s*,t\) = max AE(s,ti), 

s£S 

where 5 is a set of all the possible ways of splitting the cases in node t \ . By using 
the optimal s*, ti is split into £2 and £ 3 , and the same search procedure for the best 
s £ S is repeated on both £2 and t 3 separately, and so on. 

So far we have supposed that the inputs or attributes under consideration are 
numerical or ordered variables that assume numerical values; examples of this 
kind of variables include temperatures, heights, lengths, and so on. For binary trees, 
a typical split (or question) for a numerical variable x takes the following form: 

Is x < Si? 

Usually the split value Si is the average of the x values of two data points that are 
adjacent in terms of their x coordinates alone. For a data set of size M, the number 
of candidate splits for a numerical variable is less than or equal to M — 1. 

For categorical variables that assume labels with no natural ordering, the tree- 
growing procedure given here is still applicable except that splitting a node depends 
on how to put the possible labels of a variable into two disjoint sets. Therefore, for 
a binary tree, a typical split (question) takes the following form: 

Is x in S\? 

The set Si is a non-empty proper subset of 5, the set of all possible labels of variable 
x. To eliminate duplication due to symmetry, the size of Si is usually less than or 
equal to half the size of S. In general, a categorical variable x with k possible labels 
has (2* — 2)/ 2 = 2 k ~ 1 — 1 candidate splits for this variable. (Why?) 

Example 14.3 Splits for numerical and categorical variables 

If the x attributes of a data set are represented as a set {1, 2, 4, 7}, then the candidate 
split values are { ^ ^ ^ - } = {1.5, 3, 5.5}. In contrast, if a categorical 
variable status assumes four possible labels {single, married, divorced, widowed}, 
then we have seven candidate splits that divide a data set into two disjoint non- 
empty subsets based on this variable. 


□ 

The following example demonstrates how to split a node containing five cases, 
each of which has two numerical attributes. 

Example 14.4 Node-splitting for classification trees 

Suppose that we want to split a node t containing five data points of two attributes 
x and y, as shown in Figure 14.5, where data from classes 1 and 2 are denoted 
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by crosses and circles, respectively. Apparently, there is a total of eight possible 
splits, denoted by Si in Figure 14.5. If we choose the entropy function as the error 
measure, the impurity of this node is 

E(t) = In (2) -fin (3) 

= 0.6730. 

Now we have to evaluate the change of impurity due to each split. For instance, for 
the split of « 2 , we have 


Pi 

E(ti) 

= ^ 

= -^ln^ 

i) - 2 ln G 

f) = 0.6983, 

Pr 

E(t r ) 

= ^ 

= “3 ln G 

i) _ i ln (3 

= 0.6365. 


Therefore, the change of impurity due to split S 2 is 

A E(s 2 ,t) = E(t) -§£(*,)- §£(tr) 

= 0.0138. 

Apparently, this is not a very effective split. Following the same procedure, we can 
get a list of all split performances: 

A£(si,fc) = 0.2231, 

A E(s 3 ,k) = 0.2911, 

AE(s 4 ,k) = 0.1185, 

A E(s 5 ,k) = 0.2231, 

A E{s 6 ,k) = 0.6730, 

A E(s 7 ,k) = 0.2911, 

A E{s 8 ,k) = 0.1185. 

Therefore, the best split is s$, which separates the data most effectively and reduces 
the impurity to zero. 


□ 


Another criterion, called the twoing rule, is to select a split that minimizes 

3 

where Pj(ti) and Pj{t r ) are the probabilities of a data point in class j, given that 
the data set comes from the left and right children, respectively. The justification 
of the twoing rule can be found in the CART monograph [1] . 
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o: class 1 , x: class 2 
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Figure 14.5. Node-splitting in classification trees; see Example 14-4- (MATLAB 
file: splits.m) 


Regression Trees 

We use regression trees to solve regression problems where attributes of an object are 
used to determine one or more numerical attributes of the object. For a regression 
tree, the error measure of a node t is usually taken as the squared error, or residual, 
of a local model employed to fit the data set of the node: 


N(t ) 

E(t) = nun ^ (Vi ~ d Mi, 0 )f , 

0 i=i 


(14.4) 


where {xj,j/i} is a typical data point, dt(x,0) is a local model (with modifiable 
parameter 0) for node t and E(t) is the mean-squared error of fitting the local model 
d t to the data set in the node. If d(x, 9) = 9 is a constant function independent of 
x, then the minimizing 9 of the preceding error measure is the average value of the 
desired output yi for the node — that is, 8* = Vi- Similarly, if d(x,6) is 

a linear model with linear parameters 9, then we can always use the least-squares 
methods introduced in Chapter 5 to identify the minimizing 0* and E(t) for a given 
node t. 

For any split s of node t into ti and t r , the change in error measure is expressed 
as 

A E(s,t) = E(t) — E(ti) — E(t r ). (14.5) 

The best split s * is the one that maximizes the decrease in the error measure: 


A E(s*,t) = max A E(s,t). 

s ^ S 

The strategy for growing a regression tree is to split nodes (or data set) iteratively 
and thus maximize the decrease in E(T) = Yltef E{t), the overall error measure (or 
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Target Surface CART Surface (10 Terminal Nodes) 



Figure 14.6. Growing a CART tree. First column: target surface ; second column: 
regression tree surface with 10 rules or terminal nodes . (MATLAB file: go_cart.m) 


cost) of the tree. Therefore, the goal of growing either a classification or regression 
tree is the same: to split nodes (or, equivalently, partition data set or input space) 
recursively and thus minimize a given reasonable error measure in a greedy, single 
look-ahead manner. 

Example 14.5 Growing a regression tree 

Figures 14.6 and 14.7 demonstrate the progression of growing a regression tree to 
match the “peaks” function defined by Equation (7.1) in Chapter 7. The input- 
output surfaces of the regression tree are snapshots when the number of terminal 
nodes is equal to 10, 20, and 30, respectively. The boundary plots clearly indicate 
the square local region governed by each terminal node (or rule). 


□ 


14.3.2 TREE PRUNING 

The tree that the preceding growing procedure yields is often too large, and it is 
biased toward the training data set. Thus, it places an unreliably high degree of 
accuracy on reproducing desired outputs from the training data. In other words, 
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CART Surface (20 Terminal Nodes) CART Surface (30 Terminal Nodes) 



-2 0 2 -2 0 2 
x x 


Figure 14.7. Growing a CART tree (continued). First column: regression tree 
surface with 20 rules (or terminal nodes); second column: regression tree surface 
with 30 rules (or terminal nodes). (MATLAB file: go_cart.m) 


we may encounter the familiar problem of overfitting and overspecializing toward 
the training data, and the tree may not generalize well for new cases. 

There are several methods to find the tree size that gives a better estimate of 
the true error measure. One of the most effective methods is based on the principle 
of minimum cost-complexity or weakest-subtree shrinking. The first step is 
to grow a fully expanded tree Tmax that has a fairly low apparent error measure 
based on the training data set. This tree is usually too large, and we want to prune 
it back consistently by finding the weakest subtree in it. The weakest subtree is 
found by considering both the training error measure and the number of terminal 
nodes, which is considered a measure of the tree’s complexity. 

Definition 14.2 Cost- complexity measure of decision trees [1] 

For any subtree T C T m ax, define its complexity as |T|, the number of terminal 
nodes in T. Then the cost-complexity measure E a (T ) is defined by 

E a (T) = E(T) + a\T\, (14.6) 

where a is a complexity parameter that accounts for the cost due to the tree’s 
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complexity. Thus E a (T) is a linear combination of the cost of the tree and its 
complexity. 


□ 

For each value of a, we can find a minimizing subtree T(a) with respect to the 
cost-complexity measure for a given a: 

E a (T(a)) = min E a (T). 

T c%ax 

If T(a ) is a minimizing tree for a given value of a, then it continues to be minimizing 
as a increases until a jump point a' is reached and a new tree T{a') becomes a new 
minimizing tree. 

Suppose that Tmax has L terminal nodes. The idea of progressive upward tree 
pruning is to find a sequence of smaller and smaller trees Tl,Ti,_i,T£,_ 2 , • • •, and 
Ti that satisfies 


{*1} = Ti C T 2 C • • • C T L - 2 C T l _ 1 C T l = Tmax, 

where Ti has i terminal nodes. Each Ti - 1 is obtained from Ti as the first minimizing 
subtree of the cost-complexity measure as a increases from zero. 

To find the next minimizing tree for a tree T, we proceed as follows. For each 
internal node t in T, we first find a value for a that makes T—T t the next minimizing 
tree; this value of a, denoted by at, is equal to the ratio between the change in error 
measures and the change in the number of terminal nodes before and after shrinking: 

_ E (t) - E(T t ) 

\ft\ ~ 1 

Then we choose the internal node with the smallest at as the target node for shrink- 
ing. Therefore, a tree-pruning cycle consists of the following tasks: 

1. Calculate at for each internal node t in T* 

2. Find the minimal at and choose T — T t as the next minimizing tree. 

This process is repeated until the tree contains a single root node. Figure 14.3 
demonstrates an example of tree pruning by shrinking the subtree Tt 3 (with the 
internal node £3 as its root) into a terminal node. The tree after pruning, denoted 
as T — T ts , is a subset of the original tree. 

By repeating the pruning process, a series of candidate trees can be obtained 
by shrinking each weakest subtree sequentially, where each shrinkage results in a 
minimal increase in the value of a in proceeding toward the next minimizing tree. 
The problem has now been reduced to selecting one of these candidate trees as 
the optimum-sized tree. There are two general methods for doing this: using an 
independent test (checking) data set and performing cross-validation. Of the two, 
use of a test data set is computationally simpler, but cross-validation makes more 
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5 10 15 20 

Number of Terminal Nodes 


Figure 14.8. Error measures with respect to tree sizes. The jump from Tg to 
Tq indicates that the shrunken subtree in T 8 has more than two terminal nodes. 
(MATLAB file: carterr.m) 


effective use of all available data. To use an independent test data set, we simply 
pick the tree that generates the smallest error measure when the test data set is 
presented. The cross-validation method is more complicated, and the reader is 
referred to the CART monograph [1] for a complete treatment of it. Figure 14.8 
is a typical pattern of tree error measures versus tree sizes for the candidate trees 
Ti,T 2 ,-- -, and Tl, L = 20, obtained from the preceding tree-pruning procedure. 
As the tree’s complexity (that is, number of terminal nodes) increases, the training 
error decreases, reaching zero when the tree is fully expanded. In contrast, the 
checking error decreases initially, reaching a minimum, and then increases gradually 
due to the tree’s overspecialization to the training data. Without any other a priori 
information, we usually take the checking error as a true unbiased estimate of the 
real error measure and take the corresponding tree as the optimum-sized tree. 

14.4 USING CART FOR STRUCTURE IDENTIFICATION IN 
ANFIS 

The CART algorithm is a powerful nonparametric method with the following fea- 
tures: 

• Conceptual simplicity 

• Computation efficiency 

• Applicability to classification and regression problems 
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• Solid statistic foundation 


• Suitability for high-dimensional data 


• Ability to identify relevant inputs simultaneously. 

In this section, we describe how to use CART for structure identification in 
ANFIS. That is, we use CART to find the number of ANFIS rules and the ini- 
tial locations of membership functions (MFs) before training. For simplicity, we 
confine our scope to regression problems; a similar approach can be also used for 
classification problems. 

To construct a regression tree with constant-output terminal nodes [see Ex- 
ample 14.1 and Figure 14.2(a)], the CART algorithm described earlier can always 
identify a right-size tree and determine irrelevant inputs not required by the tree. 
On the other hand, if the terminal nodes are characterized by linear equations [see 
Example 14.1 and Figure 14.2(b)], more computation is necessary to find relevant 
inputs. One way to reduce the computation burden is to employ the least-squares 
estimator (LSE) introduced in Chapter 5; of particular importance is the LSE that 
is obtained recursively in accommodating new data and new parameters. 

It is obvious that the decision tree in Figure 14.1 is equivalent to a set of crisp 
rules: 


I lf x < a and y <6, then z — f\. 

If x < a and y >b, then z = / 2 . 

If x > a and y < c, then z = / 3 . 

If x > a and y > c, then z = f±. 


(14.7) 


Given an input vector [ x , y], only a single rule out of the four will be fired at full 
strength, while the other three will not be activated at all. This crispness reduces the 
computation required to construct the tree using CART, but it also gives undesirable 
discontinuous boundaries in the overall input-output mapping. To smooth out the 
discontinuity at each split, a natural option is to use fuzzy sets to represent the 
premise parts of the rules in Equation (14.7), thus converting Equation (14.7) into 
a set of Sugeno-style fuzzy if-then rules, as described in Chapter 4. The resultant 
Sugeno fuzzy inference model can be of zero order if fi s are constants, or first order 
if fi s are linear equations. 

To fuzzify the premise part, the statement y > c can be represented as a fuzzy 
set characterized by, for instance, the sigmoidal MF introduced in Section 2.4: 

1 

1 + exp[-a(y - c)] ’ 


Vy>c(y\ot) = sig (y,a,c) = 


(14.8) 
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(b) S MFs 



Figure 14.9. Two types of MFs for x > c (where c = 5): (a) sigmoidal MF with 
different a’s; (b) extended S MF with different 7 ’s. (MATLAB file: cartmf .m) 


where a and c are modified parameters for the MF. Similarly, we can also employ 
the S MF (see page 43) to represent the meaning of y > c: 


Hy> c (y\ w) = S{y\c- w,c + w) 



if x < c — w, 
if c — w < x < c, 

if c < x < c + w, 
if c + w < x. 


(14.9) 

To increase the degree of freedom, we can even use an extended S MF with an 
extra parameter 7 : 


f 


Vy>c(y;w, 7 ) = S ext {y;c-w,c+w, 7 ) = < 




0 , 

1 

2 


y - (c- w) 
w 


27 


1 - 

1 , 


c + w — y 
w 


5 

27 


ii y < c — w, 
if c — w <y < c, 

if c < y < c + w, 
if c + w < y. 


(14.10) 

(Note that when 7 — 0.5, the preceding extended S MF becomes a piecewise linear 
function.) Figure 14.9 shows the sigmoidal and the extended S MFs for the linguistic 
term y > c ; Figure 14.10 is the fuzzy version of the surface plots in Figure 14.2, 
where the extended S MF (with w = 1 and 7 = 1) is used to define the meaning of 
>. Remember that when a 00 in the sigmoidal MF, or when w = 0 or 7 -» 00 
in the extended S MF, both MFs reduce to the step function and the fuzzy rules 
reduce to the original crisp rules. 

Based on the fuzzy version of the rules in Equation (14.7), we can derive another 
class of adaptive network for identifying the premise and consequent parameters of 
the underlying fuzzy inference system. This ANFIS architecture is depicted in Fig- 
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Figure 14.10. Input-output behaviors of decision trees with terminal nodes char- 
acterized by (a) constants and (b) linear equations. (MATLAB file: ft surf 1 .m 
and ftsurf2.m) 


ure 14.11. Layer 1 calculates the membership grades of given input variables (INV 
nodes represent negation operators); layer 2 multiplies the given membership grades 
to find the firing strength of each rule; layer 3 computes the contribution of each 
rule based on given firing strengths; and layer 4 finds the summation of incoming 
signals, which is equal to the overall output of this fuzzy inference system. Premise 
and consequent parameters are contained in layers 1 and 3, respectively; these pa- 
rameters are fine-tuned according to the fast hybrid learning rules introduced in 
Section 12.3, or any of the other nonlinear parameter identification methods in- 
troduced in Section 6.8 of Chapter 6. Note that the normalization layer (layer 3) 
in Figure 12.1 is missing from Figure 14.11. This is attributable to the following 
theorem of implicit weight normalization [2]. 


Theorem 14.1 Implicit weight normalization in a C ART-constructed ANFIS net- 
work 

In converting a decision tree to a fuzzy inference system, if (1) p x >a(^) + fi x <a 0*0 = 
1, where x is any of the input variables and a is any of the splits of x, and (2) 
multiplication is used as the T-norm operator to calculate each rule’s firing strength, 
then the summation over each rule’s firing strength is always equal to unity. 
Proof: This theorem can be proved by induction. Let n be the number of rules 
and Wi, i = 1, . . . , n be the firing strength of the ith rule. For n = 2, we have 
wi + W 2 = 1 since wi and W 2 are the membership grades for p, x < a (x) and fi x > a (x) 
for a certain input x and a certain split value a. 

Suppose that = 1 holds when n — k. When n = k 4- 1, we need to show 

that Yli=i w i = 1 still holds. Without loss of generality, we can assume the newly 
generated rules are k and k + 1, the result from splitting the previously terminal 
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Layer 1 Layer 2 Layer 3 

(MFs) (Firing Strengths) (Rule Outputs) 


Figure 14.11. ANFIS architecture corresponding to the fuzzy version of the rule 
set in Equation (14-V- 


node k (or rule k). Consequently, we have 

E fc+1 v— 1 . . 

i=l W i = 2Jj=l Wi + W k + W k+ 1 

= £<=1 w i + tik(Px<a(x) + Px>a{x)) 

= Ei=l W i + 

= 1, 

where w k is the firing strength of rule k before splitting. This concludes the proof. 

□ 

The implicit weight normalization of the ANFIS architecture in Figure 14.11 
is maintained throughout training processes; this eliminates the need for another 
normalization layer and reduces training and application computation time as well 
a s round-off errors. 

In summary, fuzzy modeling based on the CART-ANFIS approach consists of 
two tasks: 

Structure identification This is done by CART to find an initial set of crisp 
rules. 

Parameter identification After fuzzifying the premise parts of the initial rules, 
we can construct an ANFIS architecture to fine-tune the parameters. 

The major advantage offered by this approach is that we can quickly determine 
the roughly correct structure of a fuzzy inference using CART, and then refine 
the MFs and output functions via an efficient ANFIS architecture that does not 
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need a normalization layer. Note that CART can select relevant inputs and do tree 
partitioning of the input space, while ANFIS refines the regression and makes it 
everywhere continuous and smooth. Thus it can be seen that CART and ANFIS 
are complementary and their combination constitutes a solid approach to fuzzy 
modeling. 

14.5 SUMMARY 

This chapter presents the CART (classification and regression tree) algorithm, a fast 
one-pass approach for multivariate analysis. CART is a popular non-parametric 
approach for both data classification and regression in statistics, so we devote a 
whole chapter to it. Because CART can select relevant input variables and partition 
the input space effectively, it is an ideal tool for structure identification in ANFIS 
(Chapter 12). 


EXERCISES 

1. Prove that in a complete binary tree where each node has zero or two children, 
there is always one more terminal node than there are internal nodes. 

2. Prove that the entropy function in Equation (14.2) reaches a maximum when 
Pi = P2 = • • • = Pj = 1 /J- 

3. Prove that the Gini index in Equation (14.2) reaches a maximum when p\ = 
P2 = • • • = Pj = 1 / J. 

4. A new impurity function can be defined by 

<t>(pi, • • • ,Pj ) = 1 ~ max{pi, • • • ,pj}. 

Prove that this function satisfies all the requirements in Equation (14.1) for 
impurity functions. 

5. For the preceding impurity function, write a MATLAB script to plot 0(pi , 1— pi) 
as a curve and </>(pi,P 2 , 1 — p\ — Po.) asa two-dimensional surface. 

6. Explain why a categorical variable with k possible labels has (2 k — 2)/2 = 
2 fc_1 — 1 candidate splits. 

7. Why is the A E(s, t) for regression trees [Equation (14.5)] different from that for 
classification trees [Equation (14.3)]? Suppose that we redefine the regression 
tree error measure in Equation (14.4) as the mean-squared error: 
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How should you modify A E(s,t) in Equation (14.5) accordingly? 

8. Express E(t) in Equation (14.4) as an explicit function of x* and pi, assum- 
ing that the local model for node t is a modifiable constant 0 = 9 — that is, 
d t (x,0) = 9. 

9. Repeat Exercise 8, but assume that the local model for node t is a linear 

model — that is, dt(x, 0) = 9o + x\9i +X 2 B 2 H \-x n 9 n , where x = [x\ • • - x n ] 

and T = [9 0 9i • • • 0 n ]. 

10. If d t (x,0) =9 is a modifiable constant, prove that A E(s,t) in Equation (14.5) 
is always greater than or equal to zero. When is it equal to zero? 

11. Repeat Example 14.5, but change the local model of each terminal node to a lin- 
ear model. Plot the input-output surfaces and compare them with Figures 14.6 
and 14.7. 
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Data Clustering Algorithms 


J.-S. R. Jang 


15.1 INTRODUCTION 

Clustering algorithms are used extensively not only to organize and categorize data, 
but are also useful for data compression and model construction. Chapter 11 de- 
scribes some of the on-line clustering algorithms that can be realized by unsuper- 
vised learning neural networks. This chapter introduces four of the most repre- 
sentative off-line clustering techniques frequently used in conjunction with radial 
basis function networks and fuzzy modeling: (hard) C-means (or K-means) clus- 
tering, fuzzy C-means clustering, the mountain clustering method, and subtractive 
clustering. 

Clustering partitions a data set into several groups such that the similarity 
within a group is larger than that among groups. Achieving such a partitioning re- 
quires a similarity metrics that takes two input vectors and returns a value reflecting 
their similarity. Since most similarity metrics are sensitive to the ranges of elements 
in the input vectors, each of the input variables must be normalized to within, say, 
the unit interval [0, 1]. Hence, the rest of this chapter assumes that data set under 
consideration has already been normalized to be within the unit hypercube. 

Clustering techniques are used in conjunction with radial basis function networks 
or fuzzy modeling primarily to determine initial locations for radial basis functions 
or fuzzy if-then rules. For this purpose, clustering techniques are validated on the 
basis of the following assumptions: 

1. Similar inputs to the target system to be modeled should produce similar 
outputs. 

2. These similar input-output pairs are bundled into clusters in the training data 
set. 

Assumption 1 states that the target system to be modeled is a smooth input- 
output mapping; this is generally true for real-world systems. Assumption 2 requires 
the data set to conform to some specific type of distribution; however, this is not 
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always true. Therefore, clustering techniques used for structure identification in 
neural or fuzzy modeling are highly heuristic, and finding a data set to which 
clustering techniques cannot be applied satisfactorily is not uncommon. 


15.2 K-MEANS CLUSTERING 

The K-means clustering [6, 8], also known as C-means clustering, has been 
applied to a variety of areas, including image and speech data compression [3, 7], 
data preprocessing for system modeling using radial basis function networks [9], and 
task decomposition in heterogeneous neural network architectures [3]. 

The K-means algorithm partitions a collection of n vector Xj , j = 1, . . . , n, 
into c groups Gi,i = 1, . . . ,c, and finds a cluster center in each group such that 
a cost function (or an objection function) of dissimilarity (or distance) measure is 
minimized. When the Euclidean distance is chosen as the dissimilarity measure 
between a vector x* in group j and the corresponding cluster center c*, the cost 
function can be defined by 


J =E J ‘ = E( E ii x * - c «ii 2 ) - (is- 1 ) 

i=l i=l \je, Xk&Gi J 

where Ji = Xk eGi ll x fc ~ c *ll 2 ls the cost function within group i. Thus, the 
value of Ji depends on the geometrical properties of Gi and the location of c*. 

In general, a generic distance function d(xk,Ci) can be applied for vector x* in 
group i\ the corresponding overall cost function is thus expressed as 




i=l 


i= 1 \k, Xk&Gi 



(15.2) 


For simplicity, the Euclidean distance is used as the dissimilarity measure and the 
overall cost function is expressed as in Equation (15.1). 

The partitioned groups are typically defined by an c x n binary membership 
matrix U, where the element Uij is 1 if the jth data point Xj belongs to group 
i, and 0 otherwise. Once the cluster centers c* are fixed, the minimizing for 
Equation (15.1) can be derived as follows: 


1 if ||xj — Cj|J 2 < ||xj — Cjfc|| 2 , for each k ^ i, 
0 otherwise. 


(15.3) 


Restated, Xj belongs to group i if c* is the closest center among all centers. Since a 
given data point can only be in a group, the membership matrix U has the following 
properties: 

c 

^ y * * • 5 ^ 

i=l 



Sec. 15.3. Fuzzy C- Means Clustering 


425 


and 

c n 

EE Ui .> =»■ 

i=l i=l 


On the other hand, if Uij is fixed, then the optimal center c* that minimize 
Equation (15.1) is the mean of all vectors in group i: 


c i = 


1 


E 

k,x k €Gi 


(15.4) 


where |Gj| is the size of Gi, or |Gj| = £3” =1 Uij. 

For a batch-mode operation, the K-means algorithm is presented with a data set 
Xi, i — 1, . . . , n; the algorithm determines the cluster centers c* and the membership 
matrix U iteratively using the following steps: 


Step 1: Initialize the cluster center c i5 i — 1, . . . , c. This is typically achieved by 
randomly selecting c points from among all of the data points. 

Step 2: Determine the membership matrix U by Equation (15.3). 

Step 3: Compute the cost function according to Equation (15.1). Stop if either it 
is below a certain tolerance value or its improvement over previous iteration 
is below a certain threshold. 


Step 4: Update the cluster centers according to Equation (15.4). Go to step 2. 

The algorithm is inherently iterative, and no guarantee can be made that it 
will converge to an optimum solution. The performance of the K-means algorithm 
depends on the initial positions of the cluster centers, thereby making it advisable 
either to employ some front-end methods to find good initial cluster centers or to 
run the algorithm several times, each with a different set of initial cluster centers. 
Moreover, the preceding algorithm is only a representative one; it is also possible to 
initialize a random membership matrix first and then follow the iterative procedure. 

The K-means algorithm can also be operated in the on-line mode, where the 
cluster centers and the corresponding groups are derived through time averaging. 
That is, for a given data point x, the algorithm finds the closest cluster center c* 
and it is updated using the formula 

Ac* = r/(x - Ci). 

This on-line formula is essentially embedded in many learning rules of the unsuper- 
vised learning neural networks introduced in Chapter 11. 


15.3 FUZZY C-MEANS CLUSTERING 

Fuzzy C-means clustering (FCM), also known as fuzzy ISODATA, is a data 
clustering algorithm in which each data point belongs to a cluster to a degree 
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specified by a membership grade. Bezdek proposed this algorithm in 1973 [1] as an 
improvement over earlier hard C-means (HCM) clustering described in the previous 
section. 

FCM partitions a collection of n vector x*,i = 1, . . . , n into c fuzzy groups, 
and finds a cluster center in each group such that a cost function of dissimilarity 
measure is minimized. The major difference between FCM and HCM is that FCM 
employs fuzzy partitioning such that a given data point can belong to several groups 
with the degree of belongingness specified by membership grades between 0 and 1. 
To accommodate the introduction of fuzzy partitioning, the membership matrix 
U is allowed to have elements with values between 0 and 1. However, imposing 
normalization stipulates that the summation of degrees of belongingness for a data 
set always be equal to unity: 


Y^ u ij ~ l,Vj = 1, • • • ,n. (15.5) 

i—1 


The cost function (or objective function) for FCM is then a generalization of Equa- 
tion (15.1): 

c c n 

J(U, d, . . . , c c ) = £ Ji = Y, £ <jdy, (15.6) 

i= 1 i=l j 

where Uij is between 0 and 1; c* is the cluster center of fuzzy group i; dij = ||cj — Xj|| 
is the Euclidean distance between ith cluster center and jth data point; and m 6 
[l,oo) is a weighting exponent. 

The necessary conditions for Equation (15.6) to reach a minimum can be found 
by forming a new objective function J as follows: 


J (£7, Ci , . . . , c c , Ai , . . . , \n) — J (£7, Ci , . . . , c c ) A . j — i Aj ( A > — i tiij 1) 

= E'=. £? + £?=, -MEEi «« - 1). 

( 15 . 7 ) 

where Aj, j = 1 to n, are the Lagrange multipliers for the n constraints in Equa- 
tion (15.5). By differentiating J(£7, ci , . . . , c c , Ai , . . . , A n ) with respect to all its in- 
put arguments, the necessary conditions for Equation (15.6) to reach its minimum 
are 


c, = 


E"=i M Ij x J 
e;=i«s ’ 


(15.8) 


and 



(15.9) 


Proving these two necessary conditions are as Exercise 1 at the end of this chapter. 
The fuzzy C-means algorithm is simply an iterated procedure through the preceding 
two necessary conditions. In a batch-mode operation, FCM determines the cluster 
centers Cj and the membership matrix U using the following steps [1]: 
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Step 1: Initialize the membership matrix U with random values between 0 and 1 
such that the constraints in Equation (15.5) are satisfied. 

Step 2: Calculate c fuzzy cluster centers c*,« = 1, . . . , c, using Equation (15.8). 

Step 3: Compute the cost function according to Equation (15.6). Stop if either it 
is below a certain tolerance value or its improvement over previous iteration 
is below a certain threshold. 

Step 4: Compute a new U using Equation (15.9). Go to step 2. 

The cluster centers can also be first initialized and then the iterative procedure 
carried out. No guarantee ensures that FCM converges to an optimum solution. 
The performance depends on the initial cluster centers, thereby allowing us either 
to use another fast algorithm to determine the initial cluster centers or to run FCM 
several times, each starting with a different set of initial cluster centers. 

Figure 15.1 presents a MATLAB demo of the fuzzy C-means clustering method in 
the Fuzzy Logic Toolbox. The data set, number of clusters, exponent weighting, and 
several stopping criteria can all be changed via the graphical user interface. Pushing 
the “Start” button allows for one to observe how the cluster centers move toward the 
“right” positions. In particular, if the “Label Data” is marked, you will be able to see 
how each group evolves when the cluster centers move. After the clustering process 
stops, a cluster center can be selected, which will display the membership grades 
of all data points toward the selected cluster center. Figure 15.2 illustrates the MF 
plots with respect to three cluster centers. Recall that the membership grades are 
only defined on the location of data points; the surface plots in Figure 15.2 are 
obtained via 2-D interpolation using the MATLAB command griddata. 

Bezdek’s monograph [2] provides a detailed treatment of fuzzy C-means clus- 
tering, including its variants and convergence properties. Applications of fuzzy 
C-means include medical image segmentation [5] and qualitative modeling [10]. 

15.4 MOUNTAIN CLUSTERING METHOD 

The mountain clustering method, as proposed by Yager and Filev [11, 13], 
is a relatively simple and effective approach to approximate estimation of cluster 
centers on the basis of a density measure called the mountain function. This 
method can be used to obtain initial cluster centers that are required by more 
sophisticated cluster algorithms, such as fuzzy C-means clustering introduced in the 
previous section. It can also be used as a quick stand-alone method for approximate 
clustering. The method is based on what a human does in visually forming clusters 
of a data set. 

The first step involves forming a grid on the data space, where the intersections 
of the grid lines constitute the candidates for cluster centers, denoted as a set V. 
A finer gridding increases the number of potential clustering centers, but it also 
increases the computation required. The gridding is generally evenly spaced, but it 
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Figure 15.1. Demo program for fuzzy C-means clustering. (MATLAB command: 
f cmdemo) 



Figure 15.2. MF plots for the demo of fuzzy C-means clustering in Figure 15.1. 


is not a requirement. We can have an unevenly spaced gridding to reflect a priori 
knowledge of data distribution. Moreover, if the data set itself (instead of the grid 
points) is used as the candidates for cluster centers, then we have a variant called 
subtractive clustering, as discussed in the next section. 

The second step entails constructing a mountain function representing a data 
density measure. The height of the mountain function at an a point v 6 V is equal 
to 

m (v) = Y, exp (- l|V 2(T 2 * 11 ) > 


(15.10) 
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where x* is the ith data point and a is an application-specific constant. The pre- 
ceding equation implies that each data point x* contributes to the height of the 
mountain function at v and the contribution is inversely proportional to the dis- 
tance between x* and v. The mountain function can be viewed as a measure of 
data density since it tends to be higher if more data points axe located nearby, and 
lower if fewer data points axe around. The constant a determines the height as 
well as the smoothness of the resultant mountain function; this is demonstrated in 
Example 15.1 The clustering results axe normally insensitive to the value of cr, as 
long as the data set is of sufficient size and is well clustered. 

The third step involves selecting the cluster centers by sequentially destructing 
the mountain function. We first find the point in the candidate centers V that 
has the greatest value for the mountain function; this becomes the first cluster 
center ci. (In case of more than one maxima, one of them is randomly selected 
as the first cluster center.) Obtaining the next cluster center requires eliminating 
the effect of the just-identified center, which is typically surrounded by a number 
of grid points that also have high density scores. This is realized by revising the 
mountain function; a new mountain function is formed by subtracting a scaled 
Gaussian function centered at ci: 


ranew(v) = m(v) - ra(ci) exp 


|v ~ ci 
2/3 2 


(15.11) 


The subtracted amount m(ci)exp 


v - Ci 


2p 


is i 


inversely proportional to the 


distance between v and the just-identified center ci, as well as being proportional 
to the height m(ci) at the center. Note that after subtraction, the new mountain 
function ranew(v) reduces to zero at v = ci. 

After subtraction, the second cluster center is again selected as the point in V 
that has the laxgest value for the new mountain function. This process of revising 
the mountain function and finding the next cluster center continues until a sufficient 
number of cluster centers is attained. The following example clarifies the concept. 


Example 15.1 Mountain clustering method for 2-D data 

Figure 15.3(a) displays a set of 2-D data, in which three clusters can be observed 
effortlessly. However, for data sets of higher dimensions (e.g., more than three), no 
effective visualization techniques axe available to determine the clusters visually; 
therefore clustering techniques described in this chapter must be relied on. In this 
example, the mountain method is employed to find the cluster centers. 

To demonstrate the effects of cr, Figures 15.3(b) through 15.3(d) are the surface 
plots of the mountain functions with a equal to 0.02, 0.1, and 0.2, respectively. Ob- 
viously cr affects the mountain function’s height as well as its smoothness; therefore, 
the value of cr should be chosen cautiously considering both the data size and input 
dimension. 
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Figure 15.3. Mountain construction: (a) 2-D data set , with the corresponding 
mountain function with a equal to (a) 0.02, (b) 0.1, and (c) 0.2. (MATLAB 
command: mount 1) 



Figure 15.4. Mountain destruction with = 0.1: (a) the original mountain func- 
tion with a — 0.1; (b) mountain function after the first reduction ; (c) mountain 
function after the second reduction ; (d) mountain function after the third reduc- 
tion. (MATLAB command: mount2) 


Once the a is determined (0.1 in this example) and the mountain function is 
constructed, we begin to select clusters and revise the mountain function sequen- 
tially. This is shown in Figures 15.4(a), (b), (c), and (d), with /3 equal to 0.1 in 
Equation (15.11). 


□ 

Yager and Filev [12, 13] also applied mountain clustering to the structure iden- 
tification of fuzzy modeling. They used a training data set (including inputs and 
desired outputs) to find cluster centers (x*, yf) via mountain clustering first, and 
then formed a zero-order Sugeno fuzzy modeling in which the «th rule is expressed 
as 

If X is close to Xj then Y is close to yi- 

Restated, the ith rule is based on the ith cluster centers identified by the moun- 
tain clustering method. After the structure is determined, backpropagation-type 
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gradient descent and other optimization schemes can be applied to proceed with 
parameter identification. Several examples can be found in ref. [13]. 


15.5 SUBTRACTIVE CLUSTERING 


The mountain clustering method described in the previous section is relatively sim- 
ple and effective. However, its computation grows exponentially with the dimension 
of the problem because the method must evaluate the mountain function over all 
grid points. For instance, a clustering problem with four variables and each dimen- 
sion having a resolution of 10 grid lines would result in 10 4 grid points that must 
be evaluated. An alternative approach is subtractive clustering proposed by 
Chiu [4], in which data points (not grid points) are considered as the candidates for 
cluster centers. By using this method, the computation is simply proportional to 
the number of data points and independent of the dimension of the problem under 
consideration. 

Consider a collection of n data points {xi, . . . , x n } in an M-dimensional space. 
Without loss of generality, the data points are assumed to have been normalized 
within a hypercube. Since each data point is a candidate for cluster centers, a 
density measure at data point X* is defined as 


n 

Di = ^2 ex P 

3 = 1 


(rj 2) 2 )' 


where r a is a positive constant. Hence, a data point will have a high density value 
if it has many neighboring data points. The radius r a defines a neighborhood; data 
points outside this radius contribute only slightly to the density measure. 

After the density measure of each data point has been calculated, the data point 
with the highest density measure is selected as the first cluster center. Let x Cl be 
the point selected and D Cl its density measure. Next, the density measure for each 
data point X* is revised by the formula 

Dt-Di D Cl exp ^ (rj/2) 2 )> 

where r & is a positive constant. Therefore, the data points near the first cluster 
center x Cl will have significantly reduced density measures, thereby making the 
points unlikely to be selected as the next cluster center. The constant r& defines a 
neighborhood that has measurable reductions in density measure. The constant r*, 
is normally larger than r a to prevent closely spaced cluster centers; generally r& is 
equal to 1.5r 0 , as suggested in ref. [4]. 

After the density measure for each data point is revised, the next cluster center 
x C2 is selected and all of the density measures for data points are revised again. 
This process is repeated until a sufficient number of cluster centers are generated. 
A more sophisticated stopping criterion for automatically determining the number 
of clusters can be found in ref. [4]. 
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When applying subtractive clustering to a set of input-output data, each of 
the cluster centers represents a prototype that exhibits certain characteristics of 
the system to be modeled. These cluster centers would be reasonably used as the 
centers for the fuzzy rules’ premise in a zero-order Sugeno fuzzy model, or radial 
basis functions in an RBFN. For instance, assume that the center for the ith cluster 
is c i in an M dimension. The c* can be decomposed into two component vectors 
Pi and qi, where Pi is the input part and it contains the first N element of Ci; qi 
is the output part and it contains the last M — N elements of c*. Then, given an 
input vector x, the degree to which fuzzy rule i is fulfilled is defined by 


Fi 


= exp 


( l|x-Pi|| 2 \ 

V (ra/2) 2 ) ' 


This is also the definition of the fth radial basis function if we adopt the perspective 
of modeling using RBFNs. Once the premise part (or the radial basis functions) 
has been determined, the consequent part (or the weights for output unit in an 
RBFN) can be estimated by the least-squaxes method. After these procedures axe 
completed, more accuracy can be gained by using gradient descent or other advanced 
derivative-based optimization schemes (Chapter 6) for further refinement. 


15.6 SUMMARY 

This chapter presents four of the most representative off-line clustering techniques 
frequently used in conjunction with radial basis function networks and fuzzy model- 
ing: (hard) C-means clustering, fuzzy C-means clustering, the mountain clustering 
method, and subtractive clustering. These clustering techniques provide batch- 
mode approaches to finding prototypes characterizing a data set; these prototypes 
axe then used as the centers for radial basis functions in RBFNs (Chapter 9) or 
fuzzy rules in ANFIS (Chapter 12). For data compression, these prototypes axe 
used as a codebook in vector quantization. 

Some on-line clustering algorithms implemented by unsupervised learning neural 
networks axe explained in Chapter 11. 


EXERCISE 

1. Differentiate Equation (15.7) to obtain Equations (15.8) and (15.9). 
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Chapter 16 


Rulebase Structure 
Identification 


C.-T. Sun 


16.1 INTRODUCTION 

We can perform fuzzy modeling by extracting knowledge from human experts and 
by transforming the expertise into rules and membership functions. However, de- 
pending on human introspection and experience results in some difficulties. First, 
human’s knowledge is often incomplete and episodic rather than systematic. More- 
over, there is no formal and effective way of knowledge acquisition. As a result, 
researchers have been trying to automatize the neuro-fuzzy modeling process based 
on numerical training data. In general, neuro-fuzzy modeling is a branch of system 
identification (Chapter 5) and it also involves two phases: structure identifica- 
tion and parameter identification. The former is related to finding a suitable 
number of rules and a proper partition of the feature space. The latter is concerned 
with the adjustment of system parameters, such as the membership functions, linear 
coefficients, and so on. 

The problems related to parameter identification in fuzzy modeling are cov- 
ered in Chapters 5, 6, 12, and 13. For structure identification in fuzzy modeling, 
Chapters 14 and 15 give several heuristic but practical and systematic approaches. 
However, the problem of structure identification in fuzzy modeling is by no means 
solved; there are many problems in practice remain to be addressed. This chapter 
raises these problem and suggests potential solutions from a high-level point of view. 

As mentioned in Chapter 4, if we do not use any structure identification tech- 
niques in fuzzy modeling, we have to accept the simple grid partitioning of the 
input space, as shown in Figure 4.13(a) in page 87. However, this leads to “curse 
of dimensionality” when the number of input variables becomes large. One way to 
alleviate this problem is to employ input selection schemes to choose relevant inputs 
for the modeling task. A simple input selection scheme that can be embedded in 
the parameter identification phase is described in Section 16.2. 


434 



435 


Sec. 16.2. Input Selection 


Another way to relieve the problem of an exponentially-grown rulebase is to 
have a sophisticated partitioning of the input space. A hill-climbing method based 
on k-d trees is described in Section 16.3 to implement a tree partitioning of the 
input space. Two objective functions, a density measure and a typicality measure, 
are used in the partition process to find a proper starting point for the parameter 
identification phase. 

However, when we face a complex system, a simple, global, and effective par- 
tition is difficult to find. This is the dilemma between modeling accuracy and 
learning/operation efficiency. A method of rule organization to solve this problem 
is discussed in Section 16.4. It employs the concept of divide-and- conquer to achieve 
modeling accuracy with many small rules each of them covering a small local re- 
gion. Then we build a binary fuzzy boxtree out of the rules based on a similarity 
measure between their antecedent patterns. A branch-and-bound algorithm can do 
the pattern matching job for firing appropriate rules with logarithmic efficiency so 
that the big number of rules will cause no trouble for the entire system. Moreover, 
to maintain fuzzy rulebased systems’ advantage in parallel processing, a parallel 
algorithm is proposed to meet the requirement. 

Another alternative to cope with sophisticated systems is to use rule combination 
so that we can apply a simplified rulebase. This is applicable whenever there is a 
focus area in the application domain. The algorithm for rule combination with 
respect to a boxtree is given in Section 16.5. 

The ANFIS architecture discussed in Chapter 12 is general, and it can incor- 
porate advanced fuzzy pattern-matching techniques such as weights of importance 
and fuzzy quantifiers in a fuzzy reasoning process. Further discussion on structure 
identification is based on the ANFIS model. 

16.2 INPUT SELECTION 

Up to now, the discussion of neuro-fuzzy modeling has been based on the implica- 
tion that the input variables are of equal importance. However, in applications of 
pattern recognition, time series prediction, or multi-criteria decision making, usu- 
ally this assumption is not true. In other words, the first step in a general modeling 
scheme should be input or feature selection, which can identify a subset of all 
possible inputs as the actual inputs for ANFIS/C ANFIS modeling. These identi- 
fied inputs should possess more discriminating power to produce better regression 
or classification results. 

In this section, we shall apply the concept of weight of importance in fuzzy 
pattern matching to proceed with our input selection scheme. Due to the flexibility 
of adaptive networks (Chapter 8), the input selection scheme can be embedded 
into the ANFIS architecture, so we can perform parameter identification and input 
selection simultaneously. 

Assume that an input variable x is associated with an importance measure 
cr € [0, 1]. If a premise construct “x is A” is used in a fuzzy rule with an AND- 
connected premise (IF) part, then the MF grade rescaled by the input importance 
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measure a is denoted by 


s = (1 — cr) * 1 + cr * ha{x) 

= 1 - a* [1 - 

The preceding equations represents a linear interpolation between 1 and Ha{x). If 
the input variable x is of full importance (a = 1), s reduces to ha(x{). On the other 
hand, if x is of no importance (cr = 0) , then s is 1 and it plays no part in computing 
the firing strength of an AND rule — that is, rule with an AND-connected premise 
(IF) part. 

Similarly, to incorporate the input importance measure into an OR rule or fuzzy 
rule with an OR-connected premise (IF) part, we can define a new MF grade: 

s = (1 — cr) * 0 + cr * /ia (z) 

= < 7*Ha(x ). 

This is a linear interpolation between 0 and ha{x)- If cr is close to 1, then s is close 
to ha{%) to assume its regular role. If a is close 0, then s is close to 0 and it plays 
little part in computing the firing strength. 

The preceding two equations are based on the concept of weights of importance; 
a detailed discussion can be found in ref. [2]. 

Through the concept of parameter sharing, discussed in Section 8.2, it is rather 
easy to incorporate the input importance measure cr, as an extra parameter associ- 
ated with the input x, to the ANFIS architecture in Figure 12.1 of page 336. The 
configuration of the ANFIS architecture with an input importance measure is left 
as Exercise 1. 

The initial value of an input importance measure is defaulted at 1.0 but can 
be assigned to any value in (0, 1] by users based on their heuristic judgment. Dur- 
ing the training process, once an input importance measure is stablized below a 
certain threshold value, the corresponding variable is considered unimportant in 
the system to be modeled. Thus, the variable can be neglected and a simplified 
structure/parameter identification process can be resumed to find an even better 
solution. 

16.3 INPUT SPACE PARTITIONING 

As mentioned in Section 4.5.1 of Chapter 4, the premise part of a fuzzy inference 
system implements a fuzzy partition in the multidimensional input (feature) space. 
In Figure 4.13, we have seen three partition schemes frequently used in modeling 
a multidimensional system. A recapitulation plus some additions are shown in 
Figure 16.1. 

Figure 16.1(a) is a common grid partitioning in which there is no adaptation 
method is used to change the premise part of a fuzzy inference system. An ANFIS 
model based on the grid partitioning forms the adaptive grid partitioning in Fig- 
ure 16.1(b). In other words, at the beginning of training, a uniformly partitioned 
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(c) (d) 


Figure 16.1. Four fuzzy multidimensional data structures, (a) fuzzy grid, (b) 
adaptive fuzzy grid, (c) fuzzy k-d tree, (d) multi-level fuzzy grid. Shaded area repre- 
sents overlapping among fuzzy regions. In two-dimensional space, the structure in 
(d) is also called a fuzzy quad tree. 


grid is taken as the initial state. As the parameters in the premise membership 
functions are adjusted, the grid evolves. The steepest descent method (or any other 
optimization techniques described in Chapters 6 and 7) finds the optimal location 
and size of the fuzzy regions and the degree of overlapping among them. Two 
problems exist in this scheme. First, the number of linguistic terms for each input 
variable is predetermined and is highly heuristic. Second, the learning complexity 
suffers an exponential explosion as the number of inputs increases. 

Grid partitioning gives the most restricted structures; on the other extreme, 
fuzzy clustering algorithms [1, 8] (see Section 15.3 of Chapter 15) based on train- 
ing data result in the most flexible scatter partitioning shown in Figure 4.13(c) 
(page 87). However, this approach has its own problems. First, the resulting 
structures are not necessarily hyper-rectangles, they need refinement. To map a 
fuzzy cluster to a set of bell-shaped functions, we can consider c* in Equation (2.23) 
(page 26) as the coordinate of the cluster center, h, in the ith dimension; meanwhile, 
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at is assigned to a value of the cluster radius, defined to be the longest distance 
from the center to a point, pi, with nonzero membership in the ith dimension. The 
value of bi is interpreted as a slope and can be determined as a Unear function of 
the membership of the boundary point pi , . 

The second weak point of clustering algorithms is that the cost tends to be high, 
because the efficiency of convergence is not guaranteed. Third, the total number of 
rules (clusters) is predetermined in these algorithms. When the resulting partition 
is not good enough and we want to increase the number of rules, we have to rerun 
the clustering algorithm from beginning. 

The essential point here is this: in the context of adaptive network training, we 
do not need to find a perfect clustering since our goal is just to find a satisfiable 
initial state for the adaptive network to tune. In other words, since we have verified 
the validity of the parameter identification mechanism using adaptive networks, it 
makes no sense to spend a lot of time in optimizing the cluster criteria, no matter 
what the objective functions might be. 

Thus, we adopted an intermediately flexible partitioning, the fuzzy k-d trees 
[Figure 16.1(c), or Figure 4.13(b)] for structure identification. In the following, we 
explore several ways of providing an input space partition based on fuzzy k-d trees. 

A k-d tree results from a series of Guillotine cuts. By a Guillotine cut, we 
mean a cut which is made entirely across the subspace to be partitioned; each of 
the regions so produced can then be subjected to independent Guillotine cutting. 
At the beginning of the ith iteration step, the feature space is partitioned into i 
regions. Now another Guillotine cut is applied to one of the regions to partition the 
entire space further into i 4- 1 regions. 

There are various strategies to decide which dimension to cut and where to cut it 
at each step; some of them are based merely on the distribution of training examples; 
others take the parameter identification methods into consideration. We list and 
briefly discuss several strategies in the following before introducing a hill-climbing 
method based on fuzzy clustering objective functions. 

Balanced-sampling criterion The simplest tactic is to cut the dimension in 
which the training data associated with the region are most spread out and 
to cut it at the median value of those samples in that dimension [3]. The 
expected shape of the regions under this procedure is asymptotically cubical 
because the long dimension is always cut. In general, this method produces 
homogeneously distributed localized receptive fields. 

Information gain The cutting procedure can be viewed as a method of building 
a decision tree, see Chapter 14. Quinlan [5] proposed a method based on in- 
formation theory and defined the concept of information gain at each branch. 
Therefore, the rule of thumb is to choose a cut with the most information gain 
at each step. 

Regional linearity This strategy is suitable when the consequence part of a rule 
is represented as a linear combination of the input features, such as in the 
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first-order ANFIS model. Conceptually, we want to use a hyperplane to ap- 
proximate the training samples in a region and minimize the mean-square 
error. Thus, we can apply LSE (Chapter 5) to identify a set of linear coef- 
ficients, and the cut resulting in the least error will be selected under this 
criterion. 

Direct evaluation The most direct, but inefficient, way to evaluate a partition is 
to feed the resulting structure into the parameter identification phase and use 
the final performance to choose the best cut. Sugeno and Kang [10] used this 
approach together with lots of heuristics to find the proper structure. 

The preceding methods are all crisp partitions, so the result needs to be fuzzified. 
Furthermore, the evaluation functions used in these hill-climbing algorithms are 
either too shallow, without considering the need of parameter identification, or too 
deep, and thus bothered by a massive amount of computation. A compromising 
method is to use two fuzzy clustering objective functions, a typicality measure and 
a density measure. The basic assumption of this approach is simply that a good 
fuzzy rule is usually represented by a cluster which has a prototypical center and a 
strong support from the samples. 

Our method is still an n-step hill-climbing approach, where n is the desired 
number of rules, because of the consideration of efficiency. At each step a fuzzy set 
is defined for each cluster i with the following membership function, which will be 
used in the objective functions: 


ink = n — 

j=i 1 + 


1 


%kj Cjj 


a 


ij 


2 bij » 


(16.1) 


where Hik denotes the membership of the fcth point (xk) in the ith cluster, V is 
the number of variables, Xkj is the j th coordinate of re*,, and H stands for a fuzzy 
conjunctive operator. The calculation of parameters, bij and Cij, is dependent 
on the hyper-rectangle defined by a cluster resulted from Guillotine cuts. Let hi be 
the physical center of the ith hyper-rectangle, is defined as the j th coordinate of 
hf, a,ij is calculated as half of the length along the hyper-rectangle’s jth dimension; 
and bij is determined by the desired degree of overlapping between fuzzy regions. 
At the end of the structure identification phase, a’s, 6’s and c’s are fed into the 
adaptive network as the initial parameters. 

Now we define two objective functions for the best-first search process. As 
analyzed by Bezdek [1], various objective functions can suggest radically different 
substructures in the same data set. To achieve a meaningful structure for a fuzzy 
rulebase, we have to select appropriate measures. In our approach, we use two 
objective functions, one is a density measure (Jd), the other is a typicality measure 
(Jt)- 
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Figure 16.2. A general fuzzy modeling scheme . 


Jd was proposed by Ruspini [7]: 
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where P is the number of training data, C is the number of clusters (rules), djk is 
the distance (or the measure of dissimilarity) between sampling points j and k, and 
fjLij is the membership of point j in cluster i. As pointed out by Ruspini, Jd is a 
measure of cluster quality based on local density, because Jd will be small when 
the terms in Equation (16.2) are individually small; in turn, this will occur when 
close pairs of points have nearly equal fuzzy memberships in the C clusters. 

Jt is a variation of the least-square functional proposed by Bezdek [1] : 

p c 

Jt = £ 52 V 2 ik d ik (!6-3) 

k = 1 i=l 


where dik is the distance from point k to the center (or prototype) hi of cluster i. 
We call Jt a typicality measure because it will be small when points in a cluster 
adhere tightly (have small dik s) to their cluster center hi. 

Density and typicality are important measures because they are closely related 
to two important characteristics of linguistic terms: the support and the core, re- 
spectively. The support is the range of nonzero membership values (/z > 0), whereas 
the core is the range of full membership (/z = 1). In general, we want a linguistic 
term to have a strong support (high density, or small Jd) and a representative core 
(good prototype, or small Jr). Thus, it is reasonable to choose Jd + Jt to be our 
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Figure 16.3. A clustering without salient discriminating features. 


objective function. In other words, for each possible Guillotine cut, we calculate 
Jd + Jt of the resulting partition. Then we select the partition with the least 
Jd + Jt value as our next hypothesis to continue the hill-climbing process. 

Figure 16.2 summarizes our adaptive network-based fuzzy rulebase modeling 
scheme. As mentioned previously, the cutting procedure can be viewed as a method 
of building a decision tree. Efficient decision tree construction depends on the 
existence of salient discriminating features. If it is not the case, for example, there 
are no adequate Guillotine cuts in Figure 16.3, and the mechanism of partitioning 
the feature space is no longer suitable for structure identification. Thus, usually in 
a complicated system, we are forced to use a large number of small rules. By a 
small rule we mean a rule whose antecedent part covers a relatively small region in 
the feature space. In the next section we propose a method of rule organization to 
cope with the resulting computational complexity. 

16.4 RULEBASE ORGANIZATION 

In a fuzzy inference system, the rules can also be viewed as a set of fuzzy points 
which as a whole approximate a compatibility relation. As more rules are involved, 
finer approximation as well as better modeling accuracy are likely to be achieved 
(see Figure 16.4). However, when modeling accuracy is the major concern and a 
massive amount of rules is used to main the model’s accuracy, then we must deal 
with the problem of computational complexity. A basic assumption here is that 
massively parallel implementation of an intelligent system with learning ability will 
still be costly, if not impossible, in the near future; thus, we have to realize the 
proposed ANFIS model on traditional computer systems. 

In this section, we introduce a data structure called a fuzzy boxtree to orga- 
nize rules so that pattern matching can be performed in logarithmic time. The 
mechanism includes the following steps: 

1. Use a divide-and-conquer data structure, the multi-level fuzzy grid, to par- 
tition the feature space and fine-tune a large amount of small rules so that 
accurate local mappings are achieved. 
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Figure 16.4. Fuzzy points approximating a compatibility relation. Various numbers 
of fuzzy points result in different degrees of information granularity: (a) coarse, (b) 
finer. The interpretation of fuzzy inference as a compatibility relation was discussed 
in detail by Ruspini [9J. 


2. Define a fuzzy boxtree on antecedents of rules and provide a linear-time algo- 
rithm to construct it. 

3. Introduce a branch-and-bound algorithm for pattern matching in logarithmic 
time. 

4. Provide a parallel algorithm to maintain the advantage of parallel processing 
presumed in fuzzy inference systems. 

Since we are going to use a large number of rules to achieve modeling accuracy 
and take different degrees of local complexity as well as unbalanced sample distri- 
bution into consideration, we adopt a multi-level fuzzy grid [see Figure 16.1(d)] as 
the structure to partition the feature space. The top level grid coarsely partitions 
the whole space into equal-sized and evenly spaced fuzzy boxes, which can be fur- 
ther partitioned by finer fuzzy grids. This straightforward partitioning continues 
until a terminating condition is met. Two criteria can be used as the terminating 
condition. The first is the balanced sampling criterion (i.e., the resulting boxes 
should contain similar numbers of training examples). The alternative is to use an 
application-dependent evaluation. For example, if we assume that each output is 
a linear combination of the inputs, as in the first-order ANFIS model, we can use 
LMS methods to evaluate the fitness of each grid. When the mean square error is 
below a threshold, we stop the partitioning process. 

Now we can apply the learning model based on adaptive networks to identify the 
parameters in each region. Because of the small size of the regions and the result- 
ing local linearity, the learning efficiency is expected to be good. Moreover, since 
the entire feature space is still covered by the overlapping regions, the smoothness 
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among regions will not be affected although the regions are now separately trained. 
The only problem to be solved is the computational complexity in operation time 
due to the resulting large number of rules. To solve the problem, we construct a 
boxtree to put the rules together. 

A binary fuzzy boxtree, T, is a rooted tree in which each internal node has 
two children. Let R denote the set of nodes of T. Each node r € R is a fuzzy set 
with a membership function /i r (u) such that 

If s is a child of r then fi s (u ) < /x r (u), Vu £ U\ in other words, s C r. 

In our application, each leaf stands for a fuzzy pattern which represents the an- 
tecedent part of a certain fuzzy rule. Moreover, each membership function is bell 
shaped and is determined by six parameters: 

if u < c\, 

if ci <u<C2, (16.4) 

if u > C2, 


where ci < C 2 - Note that the 3-parameter generalized bell membership function 
defined in (2.23) (page 26) is a special case of this function with ai = a, 2 ,bi = 
62 , ci = C2 . 

The similarity measure between two fuzzy patterns, A and B, is given by the 
following formula: 

S(A,B)=]\S(A i ,B i ), (16.5) 

i 

that is, a conjunctive aggregation of partial similarity measures, S(Ai, Bi)’s, in 
individual feature dimensions. S(Ai,Bi ) is in turn calculated by 

S(Ai,Bi) = supueuimini/jLAii^^Biiu))}. (16.6) 

(See Figure 16.5.) The boxtree construction algorithm repeatedly finds the two 
boxes (patterns) with the largest similarity degree, makes them siblings, and inserts 
the parent as an internode. 

A membership function in an internode C is defined as a combination of the 
corresponding functions in its child nodes, A and B. For example, if A is specified by 
parameter set {a lA ,bi A ,c lA ,a 2A ,b 2 A ,C 2 A } and B by {oi b ,6i b ,ci b ,o 2b ,6 2b ,c 2b }, 
and if {c\ A ,C 2 A ) < (ci B ,C 2 B ) in Pareto ordering, we use the following parameters 
to characterize C : 

di c d\ A ,bi c bi A ,Ci c C\ A , o,2c d,2 B ,b2 c — b2 B , C2 c — C2 B . (16.7) 
Figure 16.6 shows the idea and the resulting inclusion relation among fuzzy sets. 
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XI 


Figure 16.5. Similarity measure between two fuzzy sets. 



XI 


Figure 16.6. Constructing an intemode in a boxtree. (a) covering of membership 
functions, (b) corresponding inclusion of boxes. 


Since an internode inherits the (fuzzy) boundaries defined by its children, the 
preceding construction can be realized in linear time by employing the famous 
greedy algorithm for finding a maximum spanning tree, given the similarity measure 
between each pair of fuzzy patterns. Figure 16.7 shows a boxtree constructed. 

For r € R that is a leaf, /z r (u) is the compatibility measure of an input u against 
the pattern r; for an internode r, /x r (u) is the upper bound of compatibility of 
the subtree it defines. This property provides a data structure to apply the basic 
branch-and-bound algorithm in searching optimal solutions. For example, if we 
want to find all rules which have a firing strength larger than a specified value, the 
boxtree structure allows a search from the root to prune any subtrees whose root 
function is smaller than that value. 

Thus, we can use the following algorithm to find the best rule against which an 
input u is matched. In this algorithm, F stands for a frontier of expanded nodes, 
B is the upper bound to a certain point. 

Algorithm 16.1 Branch-and-bound algorithm to find the best-matched rule 

1. F f — {Root}; B ^ — — oo ; 

2. while F 0 do 

select a set of nodes S C F; 

expand the internodes in S to get the set of their children, L(S); 
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Figure 16.7. A boxtree. Each leaf corresponds to a fuzzy rule . The dotted lines 
define a frontier in a boxtree. Each frontier can be considered as a compressed 
rulebase. 


F <r- {F — S} U L(S); B <- max({i?} U {p v (u) : v 6 5 and v is a leaf}); 
F 4- {v 6 F : p, t ,(u) > B }. 


□ 

Usually, in a fuzzy inference system, it is not necessary to find all rules with 
a firing strength greater than zero. Instead, we axe satisfied with the best k rules 
whose antecedents axe compatible to the input. Algorithm 16.1 can be generalized 
to this case by keeping a priority queue of size k and using the fcth best value in 
pruning. This algorithm is of 0(log 2 R) efficiency in pattern matching for a fuzzy 
rulebase with R rules. 

The advantage of parallel processing is presumed for fuzzy rulebased inference 
systems because the rules axe considered to be independent of each other in the 
pattern matching process. If we organize the rules into a structure, the boxtree, 
can we still claim the benefit? In other words, if we have p processors instead of 
one, can we decrease the processing time to O(^), or achieve a linear speedup? The 
answer is positive. 

Let each of the p processors maintain a local frontier, Fj, and a local priority 
queue B{. At each step every processor i does one of two things: 
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1. If Fi ^ 0 then it expands the node of best matching in Fi and sends its 
children to processors chosen at random. 

2. If Fi = 0 then it sends the message “there is a rule of firing strength s” to 
processors chosen at random. 

The processors then update the sets Fi and queues Bi on the basis of the messages 
received. The computation continues until all sets Fi are empty. At this point, 
the best matches are given by the merge of Bi s. This algorithm provides a lin- 
ear speedup. The details of a general parallel algorithm for a branch-and-bound 
procedure are described in ref. [11]. The analysis of its complexity is discussed in 
ref. [6]. 

In summary, given a fuzzy rulebase with R rules that models an application 
system, we can build a boxtree of 2R — 1 nodes and use a parallel branch-and- 
bound algorithm to perform the pattern-matching task with logarithmic efficiency. 
Consequently, with the boxtree data structure, we can use many more rules in the 
modeling process to achieve high performance without losing efficiency in the later 
pattern-matching process. 

16.5 FOCUS SET-BASED RULE COMBINATION 

To improve the performance of self-organized systems, such as those based on adap- 
tive networks, dynamic skeletonization is usually necessary. By skeletonization 
we mean trimming the redundant or the less important part of a complicated sys- 
tem, as suggested in ref. [4]. However, we claim that skeletonization should be 
done under a dynamic relevance criterion (i.e., which paxt to trim or to simplify 
should be determined by the current situation). In this section we discuss a method 
of skeletonization, or rulebase compression, for adaptive network-based fuzzy 
inference systems. 

Note that in Figure 16.7, every frontier in a boxtree can be viewed as a fuzzy 
rulebase because it covers the entire feature space. For a frontier containing intern- 
odes, we can use either of the following two methods to determine the consequent 
as well as the antecedent parameters. The first way, the local approach, is to adopt 
the antecedent parameters as specified in the boxtree and to use LMS methods of 
finding a hyperplane to approximate the training data covered by individual re- 
gions. The alternative, the global approach, is to use the antecedent parameters as 
the initial values for the ANFIS model and rerun the entire training process. The 
latter way is much more time consuming and should be used only when the goal is 
to find a merged rulebase permanently. If we consider the feature space partition- 
ing method introduced Section 16.3 as a top-down approach, rule merging can be 
viewed as a bottom-up way of identifying a system structure. 

As asserted before, a fuzzy rulebased system can be used to solve various in- 
terpolation and classification problems. However, to be of practical use for system 
representation and communication purposes (e.g., in medical application of image 
archiving systems), dynamic rulebase compression becomes essential. With rule 
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compression those components of the system axe (temporarily) simplified that axe 
supposed to be of less relevance for the current use of the system. The more relevant 
a component, the more rules axe used for that component. The degree of relevance 
is specified by a focus set. 

A focus set, or a focus window, is a fuzzy set defined on the feature space 
which indicates the focus of our current interest. Given a focus window W, the 
similaxity gain, G(r), defined on an internode r is calculated as 

G(r) = S(r u W) + S(r 2 ,W) - S{r,W ), (16.8) 

where r \ , r 2 axe the two children of r, and S is the similarity measure defined before. 

Now we can use the following algorithm to find the most suitable frontier con- 
taining n rules to approximate the original rulebase with respect to W . 

Algorithm 16.2 Algorithm to find the best-matched rulebase with respect to a focus 
set 

1. Ft- {Root}; calculate similaxity gain G(Root) for Root; 

2. while |F| < n do 

select an internode r € F with the largest similarity gain G(r); 
expand r to get the set of its children, L(r); 
calculate similaxity gain for nodes in L(r); 

F t- {F - r} U L(r). 


□ 


This algorithm is of lineax efficiency. 

A compressed rulebase with respect to a focus window is shown by dotted lines 
in Figure 16.7. Once the structure is determined, we apply the local approach 
mentioned previously to identify the consequent paxameters. Thus, a simplified 
but still proper rulebase is constructed. It can be used for applications like image 
coding and hierarchical pattern matching. When higher resolution is required for a 
simplified region, the corresponding internode can be expanded to provide a finer 
sub-rulebase. 

16.6 SUMMARY 

Structure identification in general is realized with two different approaches. The 
top-down method paxtitions the input (feature) space by Guillotine cuts under the 
guidance of fuzzy clustering objective functions. The two measures, density and 
typicality, we choose to evaluate clusters have a sound theoretical background in 
fuzzy sets. 

The bottom-up approach emphasizes modeling accuracy and uses many small 
rules. The rules axe structured into a fuzzy binary boxtree to speed up the pattern 
matching process when the rulebase is in operation. A parallel algorithm described 
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in this chapter can be used to maintain the advantage of fuzzy systems in parallel 
processing. 

The rules can also be merged or combined according to a dynamically defined 
focus set. Algorithms for identifying a suitable frontier in a boxtree and determining 
the parameters are described in this chapter. The sub-rulebases thus found can 
be organized into a hierarchy so that rulebases with various granularities can be 
employed in accordance with different demands of accuracy and efficiency. 


EXERCISE 

1. Modify the two-input two- rule ANFIS architecture in Figure 12.1 to incorporate 
two importance measures <j\ and 02 for both inputs. Explain clearly the node 
functions for your new ANFIS architecture. 


REFERENCES 

[ll Jim C. Bezdek. Pattern recognition with fuzzy objective function algorithms. Plenum 
Press, New York, 1981. 

[2] Didier Dubois, Henri Prade, and Claudette Testemale. Weighted fuzzy pattern match- 
ing. Fuzzy Sets and Systems, 28:313-331, 1988. 

[3] Jerome H. Friedman, Jon Louis Bentley, and Raphael Ari Finkel. An algorithm for 
finding best matches in logarithmic expected time. ACM Transactions on Mathemat- 
ical Software, 3(3):209-226, 1977. 

[4] Michael C. Mozer and Paul Smolensky. Skeletonization: a technique for trimming 
the fat from a network via relevance assessment. Technical Report CU-CS-421-89, 
Department of Computer Science and Institute of Cognitive Science, University of 
Colorado, 1989. 

[5] J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. 

[6] Abhiram Ranade. A simpler analysis of the Karp- Zhang parallel branch-and-bound 
method. Technical Report UCB/CSD 90/586, Computer Science Division, University 
of California, Berkeley, 1990. 

[7] Enrique H. Ruspini. Numerical methods for fuzzy clustering. Information Sciences, 
2:319-350, 1970. 

[8] Enrique H. Ruspini. Recent development in fuzzy clustering. In Fuzzy set and possi- 
bility theory, pages 133-147. North Holland, 1982. 

[9] Enrique H. Ruspini. On the semantics of fuzzy logic. International Journal of Ap- 
proximate Reasoning , 5:45-88, 1991. 

[10] M. Sugeno and G. T. Kang. Structure identification of fuzzy model. Fuzzy Sets and 
Systems, 28:15-33, 1988. 



REFERENCES 


449 


[11] Yanjun Zhang. Parallel algorithms for combinatorial search problems. Technical Re- 
port UCB/CSD 89/543, Computer Science Division, University of California, Berke- 
ley, 1989. 




Part VI 


Neuro-Fuzzy Control 




Chapter 17 


Neuro- Fuzzy Control I 


J.-S. R. Jang 


17.1 INTRODUCTION 

Application of fuzzy inference systems to automatic control was first reported in 
Mamdani’s paper [17] in 1975, where, based on Zadeh’s proposition [32], a fuzzy 
logic controller (FLC) was used to emulate a human operator’s control of a steam 
engine and boiler combination. Since then, fuzzy logic control [12, 14, 15, 25] 
has gradually been recognized as the most significant and fruitful application for 
fuzzy logic and fuzzy set theory. In the past few years, advances in microproces- 
sors and hardware technologies have created an even more diversified application 
domain for fuzzy logic controllers, which ranges from consumer electronics to the 
automobile industry. Indeed, for complex and/or ill-defined systems that are not 
easily subjected to conventional automatic control methods, FLCs provide a feasible 
alternative since they can capture the approximate, qualitative aspects of human 
reasoning and decision-making processes. However, without adaptive capability, 
the performance of FLCs relies exclusively on two factors: the availability of hu- 
man experts, and the knowledge acquisition techniques to convert human expertise 
into appropriate fuzzy if-then rules and membership functions. These two factors 
substantially restrict the application domain of FLCs. 

On the other hand, investigation into using neural networks in automatic control 
systems did not receive much attention until the backpropagation learning rule was 
reformulated by Rumelhart et al. [24] in 1986. Since then, research of neural control 
has evolved quickly and a number of neural controller design methods have been 
proposed in the literature [6, 23, 28]. 

As explained in Chapters 8, 9, and 12, supervised learning neural networks and 
fuzzy inference systems are special instances of adaptive networks, which in certain 
ways are the most general form of modeling and computing-structure construction. 
Consequently, a neural control design approach can usually be carried over directly 
to the design of fuzzy controllers, unless the design method depends directly on the 
specific architecture of the neural network used (which is rare). This portability 
endows us with a number of design methods for fuzzy controllers which can easily 
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take advantage of a priori human information and expertise in the form of fuzzy if- 
then rules. The resulting methodologies, often referred to as neuro-fuzzy control, 
axe the topics of this and the next chapter. 

Generally speaking, these methodologies can be classified into two categories. 
The first category consists of design methods obtained directly from neural con- 
trol literature directly, such as expert (mimicking) control, inverse learning, spe- 
cialized learning, backpropagation through time, and real-time recurrent learning; 
these approaches are discussed in this chapter. The second category contains de- 
sign methods that axe not directly or necessarily related to neural-like learning; 
these methods are explained in the next chapter. Some of the design methods in 
the second category take advantage of conventional control techniques such as gain 
scheduling, feedback linearization, adaptive control, and sliding mode control; oth- 
ers apply derivative-free optimization techniques (Chapter 7) or reinforcement types 
of learning (Chapter 10). 

As usual, there is no single best way to design, a fuzzy controller under all 
circumstances. We shall summarize the pros and cons of each of these methods 
and provide simple guidelines for choosing an appropriate method for a specific 
application in this and the next chapter. 

17.2 FEEDBACK CONTROL SYSTEMS AND NEURO-FUZZY 
CONTROL: AN OVERVIEW 

17.2.1 Feedback Control Systems 

Figure 17.1 is a block diagram of a typical feedback control system, where the 
plant (or process) represents the dynamical system to be controlled and the con- 
troller employs a control strategy to achieve a control goal. Here we shall denote 
the state variables of the plant as a vector x(t); these variables axe usually governed 
by a set of state equations (usually differential equations) that characterize the 
dynamic behavior of the plant. Since the state variables axe internal to the plant, 
some of them may not be directly measurable from the external world. The measur- 
able quantities of the plant, also known as its outputs, are denoted as a vector y(t) 
and axe usually a static function of the state variables. Unless otherwise specified, 
we shall assume that ail states axe measurable; thus the output of the plant y (t) 
is equal to the state x(t). The preceding state equation will be used extensively in 
subsequent discussions. 

The state equation for a general nonlinear time-invariant plant can be expressed 
in the matrix notation 

x(t) = f(x(t),u(t)) (plant dynamics), (17.1) 

where u(t) is the controller’s output at time t, and the size of the vector x(t) is called 
the order of the plant. A general control goal is to find a controller with a static 
function </>(•) that maps an observed plant output x(£) to a control action u — that 
is, u (t) = <f>(x(t )) — such that the plant output x(£) can follow some given desired 
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Plant 

Dynamics 


y(t) = x(t) 


x(t) 


y(t) 


Figure 17.1. Block diagram for a continuous-time feedback control system. 


output signal xa{t) as closely as possible. If xa(t) is a constant vector (usually 
the origin in state space), then the control problem is referred to as a regulator 
problem, where the plant states are directly fed back to the controller. This is 
actually what Figure 17.1 shows. On the other hand, if the desired trajectory x<*(£) 
is a time- varying signal, then we have a tracking problem in which an error 
signal, defined as the difference between desired and actual outputs, is fed back to 
the controller. If f is unknown, we need to perform system identification first to find 
an appropriate model for the plant. Moreover, if f is time varying, it is desirable to 
make </>(•) adaptive to respond to the changing characteristics of the plant. 

If we are dealing with a linear feedback control system, the plant and controller 
can be reformulated as the following equations: 

x(£) = Ax(t) -I- Bu(£) (plant dynamics), . . 

u (t) = Kx(t) (linear controller). * * ^ 

The treatment of linear feedback control systems is relatively complete in the liter- 
ature (for example, see refs. [2, 33]) and will not be discussed separately here. 

In contrast, control systems without feedback loops are called open-loop con- 
trol systems and lack certain advantages that feedback control systems provide. 
They are relatively more sensitive to unexpected external disturbances and inter- 
nal changes in system characteristics. Thus, they are often bypassed in favor of 
feedback control systems in real-world applications, despite the latter’s tendency 
toward instability due to overcorrecting in response to feedback signals. All the fol- 
lowing discussions are therefore based on feedback control systems unless otherwise 
indicated. 

A simple example of a nonlinear feedback control system is the inverted pendu- 
lum system. We shall use this system repeatedly throughout the rest of this chapter 
and the next one. 

Example 17.1 The inveHed pendulum system 

Figure 17.2 shows an inverted pendulum system (also called a “cart-pole system”), 
a classic example of a nonlinear feedback control system. A rigid pole is attached 
to a cart with a hinge, a free joint with only one degree of freedom. The cart can 
move to the right or left on rails when a force is exerted on it. This dynamic system 
is characterized by four state variables: 0 (angle of the pole with respect to the 
vertical axis), 9 (angular velocity of the pole), 2 (position of the cart on the track) 
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e 



Figure 17.2. The inverted pendulum system. 


and z (velocity of the cart); these state variables are governed by the following 
second-order differential equations [3, 13]: 


9 sin# + cose(=^ffl|^ini) 


1 ( 4 m cos 2 6 \ ’ 

\5 m c + m) 

(17.3) 

„ u + ml{9 2 sin 6 — 9 cos 9) 

m c +m ’ 

(17.4) 


where g is the acceleration due to gravity (usually 9.8 meter/sec 2 ), m c is the mass 
of the cart, m is the mass of the pole, l is the half-length of the pole, and u is the 
applied force in Newtons. By defining the state vector [a?i x 2 %3 x^] T as [0 9 z z] T , 
we can put the preceding equations into the standard format for state equations: 


x = 


Xi 

X2 

XZ 

X4 


= f(x,u) = 


X 2 



u + ml{x\ sinxx — x 2 cosxi 


m c + m 


(17.5) 


(Note that x\ = 9 and x\ = 9. Also, x 2 = 9 by definition, so we have x\ = x 2 . 
Similarly, r '3 = x±.) 

As control engineers, our mission is to find a controller u = <£(x) that maps a 
state vector x (or an error signal x^ - x) into an appropriate force u, such that a 
control goal can be achieved in a satisfactory manner. Usual control goals for the 
inverted pendulum system include the following: 
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X(k) 



x(k+1) 


Figure 17.3. Block diagram for a discrete-time feedback control system. 


• Keep the pole balanced, regardless of the cart position. 

• Keep the pole tracking a desired signal, regardless of the cart’s position. 

• Keep the pole balanced and limit the cart’s position to a track of limited 
length. 

• Keep the pole balanced while the cart is tracking a desired signal. 

The preceding control goals are listed according to their degrees of difficulty. We 
shall see how some of these goals may be achieved in subsequent discussions. 

□ 

A general block diagram of a feedback control system in a discrete-time domain 
is shown in Figure 17.3; x(fc) and u(A;) are the state vector and control action, 
respectively, at time k. (When we say “time is k ” in a discrete-time sense, what 
we really mean is “time is ArT,” where T is the sampling period of the underlying 
computer-controlled system.) Note that the inputs to the plant block include the 
control action u (k) and the previous plant output x(£) (assuming that the plant 
state vector is equal to the plant output vector), so the plant block now represents 
a static mapping. In symbols, we have 

f x(fc + 1) = f(x(fc),u(fc)) (plant), 

\ u (k) = g(x(A:)) (controller). ' * ' 

Again, the control problem becomes that of finding the mapping </>(•) for the con- 
troller such that the resulting overall system exhibits certain desired behavior. 

We shall use Equations (17.1) and (17.6) extensively in the following discussion. 
So we reiterate the assumptions behind these formulas: 

• The order of the plant (that is, the number of state variables) is known. 

• All state variables in the plant are measurable, that is, all states are also 
output variables. 
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17.2.2 Neuro-Fuzzy Control 

If we replace the controller blocks in Figures 17.1 and 17.3 with neural networks or 
fuzzy inference systems, then we end up with neural or fuzzy control systems, 
respectively. In other words, neural or fuzzy control design methods are systematic 
ways of constructing neural networks or fuzzy inference systems, respectively, as 
controllers intended to achieve prescribed control goals. In the same vein, neuro- 
fuzzy control refers to the design methods for fuzzy logic controllers that employ 
neural network techniques. In particular, we shall concentrate on design methods 
for ANFIS (adaptive neuro-fuzzy inference systems; see Chapter 12); thus ANFIS 
and neuro-fuzzy controllers will be used interchangeably in this book. 

As demonstrated in previous chapters, fuzzy inference systems (FISs) and multi- 
layer perceptrons (MLPs) axe special instances of a more general computing frame- 
work called the adaptive networks, therefore, both of these instances inherit the 
backpropagation learning ability of the adaptive network. However, the fuzzy infer- 
ence system is superior to the multilayer perceptron in that the former can represent 
structured knowledge while the latter is more or less like a black box. As a result, 
we can identify some unique properties of ANFIS controllers: 

1. Learning ability 

2. Parallel operation 

3. Structured knowledge representation 

4. Better integration with other control design methods 

Note that a multilayer perceptron also has properties 1 and 2, but not 3 and 4. 
In the rest of this chapter and the next one, we shall introduce several neuro-fuzzy 
design methods for constructing an ANFIS controller. Note that while some of the 
methods are unique to ANFIS, most of them are derived directly from methods for 
neural controller design, and these methods usually apply directly to more general 
cases of adaptive network controller design. 

Most neural or fuzzy controllers axe nonlinear; thus rigorous analysis for neuro- 
fuzzy control systems is difficult and remains a challenging area for further investiga- 
tion. On the other hand, a neuro-fuzzy controller usually contains a large number of 
parameters; it is thus more versatile than a linear controller in dealing with nonlin- 
ear plant characteristics. Therefore, neuro-fuzzy controllers almost always surpass 
pure linear controllers if designed properly. 

17.3 EXPERT CONTROL: MIMICKING AN EXPERT 

The original purpose of a fuzzy logic control, as proposed in Mamdani’s seminal 
paper in 1975, was to mimic the behavior of a human operator able to control a 
complex plant satisfactorily. The complex plant in question could be a chemical 
reaction process, a subway train, or a traffic signal control system. After more than 
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20 yeaxs, the ultimate goal of fuzzy controllers stays the same— that is, to automate 
an entire control process by replacing a human operator with a fuzzy controller 
made up of computer software/hardware, or a single silicon chip that costs little, 
responds quickly, behaves consistently, and works around the clock. 

To construct a fuzzy controller, we need to perform knowledge acquisition, 
which takes a human operator’s knowledge about how to control a system and gen- 
erates a set of fuzzy if-then rules as the backbone for a fuzzy controller that behaves 
like the original human operator. Usually we can obtain two types of information 
from a human operator: linguistic information and numerical information. 

Linguistic information An experienced human operator can usually summarize 
his or her reasoning process in arriving at final control actions or decisions 
as a set of fuzzy if-then rules with imprecise but roughly correct membership 
functions; this corresponds to the linguistic information supplied by human ex- 
perts, which is obtained via a lengthy interview process plus a certain amount 
of trial and error. 

Numerical information When a human operator is working, it is possible to 
record the sensor data observed by the human and the human’s corresponding 
actions as a set of desired input-output data pairs. This data set can be used 
as a training data set in constructing a fuzzy controller. 

Prior to the emergence of neuro-fuzzy approaches, most design methods used 
only the linguistic information to build fuzzy controllers; this approach is not easily 
formalized and is more of an art than an engineering practice. Following this ap- 
proach usually involves manual trial-and-error tweaking processes to fine-tune the 
membership functions. Successful fuzzy control applications based on linguistic in- 
formation plus trial-and-error tuning include steam engine and boiler control [17], 
Sendai subway systems [31], container ship crane control [30], elevator control [16], 
nuclear reaction control [1], automobile transmission control [9], aircraft control [5], 
and many others [25]. 

Now, with learning algorithms, we can take further advantage of the numerical 
information (input-output data pairs) and refine the membership functions in a 
systematic way. In other words, we can use linguistic information to identify the 
structure of a fuzzy controller, and then use numerical information to identify the 
parameter such that the fuzzy controller can reproduce the desired action more ac- 
curately. Full exploitation of linguistic/numerical information is expected to expand 
greatly the application possibilities of fuzzy controllers. 

Note that mimicking a human expert is not only good for control applications. 
If the target system to be emulated is a human physician or a credit analyst, then 
the resulting fuzzy inference systems become fuzzy expert systems for diagnosis 
and credit analysis, respectively. 

The capacity to use linguistic information is specific to fuzzy inference systems; 
it is hard for multilayer perceptrons to take advantage of this information and 
directly encode it into the network’s structure. Using numerical data to train neural 
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networks to emulate a human expert has also achieved some success; examples 
include the unmanned vehicle developed at Carnegie-Mellon University [21, 22]. 

17.4 INVERSE LEARNING 
17.4.1 Fundamentals 

The development of inverse learning [29], also known as general learning [23], 
for designing neuro-fuzzy controllers involves two phases. In the learning phase, an 
on-line or off-line technique is used to model the inverse dynamics of the plant. The 
obtained neuro-fuzzy model, which represents the inverse dynamics of the plant, is 
then used to generate control actions in the application phase. These two phases can 
proceed simultaneously, hence this design method fits in perfectly with the classical 
adaptive control scheme. 

By assuming that the order of the plant (that is, the number of state variables) 
is known and all state variables are measurable, we have 

x(* + l) = f(x(*),ti(*)), (17.7) 

where x(fc + 1) is the state at time k + 1, x(fc) is the state at time k , and u(k) is 
the control signal at time k. [For pedagogic purposes, we assume here that u(k) is 
a scalar.] Similarly, the state at time k + 2 is expressed as 

x(fc + 2) = f(x(fc + 1), u(k -I- 1)) = f(f(x(fc), u(k)),u(k -I- 1)). (17.8) 

In general, we have 

x(jfe + n) = F(x(lfe),U), (17.9) 

where n is the order of the plant, F is a multiple composite function of f, and U is 
the control actions from k to k + n — 1, which is equal to [ u(k ), u(k+ 1), . . . , u(k + 
n — 1)] T . The preceding equation points out the fact that given the control input 
u from time k to k + n — 1, the state of the plant will move from x(fc) to x(& -I- n) 
in exactly n time steps. Furthermore, we assume that the inverse dynamics of the 
plant do exist, that is, U can be expressed as an explicit function of x(fc) and 
x(fc + n): 

U = G(x(fc),x(/r -I- n)). (17.10) 

This equation essentially says that there exists a unique input sequence U, specified 
by mapping G, that can drive the plant from state x(&) to x(fc + n) in n time steps. 
The problem now becomes how to find the inverse mapping G. 

Let us look at a special case in which the state equation in Equation (17.7) is 
linear. 

Example 17.2 Inverse dynamics of a linear system 
In terms of linear systems, Equation (17.7) can be written as 


x(A; -I- 1) = Ax(fc) Bu(fc), 


(17.11) 
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where A and B are n x n and n x 1 matrices, respectively. By repeating the 
preceding equation, we obtain the state at k + n: 

x(k + n) = A n x(fc) + WU, (17.12) 

where W = [A n-1 B • • • AB B] is the controllability matrix. If W is nonsingular, 
then the system is controllable and U can be calculated as 

U = W -1 [x(fc + n) — A n x(fc)]. (17.13) 

In other words, the controllability in a linear system is equivalent to the inverse 
condition mentioned earlier. 


□ 

Although the inverse mapping G in Equation (17.10) exists by assumption, it 
does not always have an analytically close form. Therefore, instead of looking for 
methods of solving Equation (17.10) explicitly, we can use an adaptive network 
or ANFIS with 2 n inputs and n outputs to approximate the inverse mapping G 
according to the generic training data pairs 

[x(fc) T ,x(fc + n) T ;U T ]. (17.14) 

Figure 17.4 illustrates the situation in which n is equal to 1. Figure 17.4(a) shows 
a plant block in which the plant output x(k + 1) is a function of a previous state 
x(k) and input it(fc); we use z~ l block to represent the unit-time delay operator. 
Figure 17.4(b) is the block diagram during the training phase; Figure 17.4(c) is the 
block diagram during the application phase. 

Assume that the adaptive network truly imitates the input-output mapping of 
the inverse dynamics G. Then, given the current state x(k) and the desired future 
state x.d(k -I- n), the adaptive network will generate an estimated U: 

JJ — G(x(k),Xd{k + n)). (17.15) 

After n steps, this control sequence can bring the state x(fc) to the desired state 
Xd(k + n), assuming that the adaptive network function G is exactly the same as 
the inverse mapping G. This application phase is shown in the block diagram of 
Figure 17.4(b). If the future desired state Xd{k + n ) is not available in advance, 
we can use the current desired state x.d(k) instead in Figure 17.4(b). This implies 
that the current desired state will appear after n time steps and the whole system 
behaves like a pure n-step time delay system. 

When G is not close to G, the control sequence U cannot bring the state to 
Xd{k + n) in exactly the next n time step. As more data pairs are used to refine 
the parameters in the adaptive network, G will become closer to G and the control 
will be more and more accurate as the training process goes on. 

For off-line applications, we have to collect a set of training data pairs and then 
train the adaptive network in the batch mode. For on-line applications to deal 
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u(k) 


Plant 



x(k+1) = f(u(k), x(k)) 


(a) 



x(k+1) 


(c) 

Figure 17.4. Block diagram for the inverse learning method: (a) plant block; (b) 
training phase; (c) application phase. 


with time- varying systems, the control actions in Equation (17.15) are generated 
every n time steps while on-line learning occurs at every time step. Alternatively, 
we can generate the control sequence at every time step and apply only the first 
component to the plant. Figure 17.5 is a block diagram for on-line learning when 
n is equal to 1. The dashed line in the figure indicates that the two ANFIS blocks 
are exact duplicates of each other. (For simplicity, we have removed the unit-time 
delay operator z ~ l .) 

The rationale behind inverse learning seems straightforward. However, it as- 
sumes the existence of inverse dynamics for a plant, which is not generally valid. 
Moreover, minimization of the network error ||U — U|| 2 does not guarantee min- 
imization of the overall system error ||xd(fc) - x(fc)|| 2 . 
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Duplicate 


Figure 17.5. Block diagram for on-line inverse learning. 




100 


Figure 17.6. Collecting training data for inverse control, where the upper plot is 
u(k ) and the lower one is y(k). (MATLAB file: inv_sig.m) 


17.4.2 Case Studies 

Suppose that a plant is described by the following discrete dynamical equation: 

y {k + 1) = _ tan(u(fc)), (17.16) 

1 4- y \k) 

where y{k) and u{k) are the state and control action, respectively, at time step k. 
Here we assume the dynamics of the plant to be unknown, and we axe going to 
build an ANFIS that maps given a given input pair [y(fc), y(k + 1)] to a desired 
control action u(k). This mapping is not easily expressed as an analytical formula, 
even if the preceding equation governing the plant dynamics is already known. 

To train the ANFIS, we need to collect training data pairs. This is done by 
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Figure 17.7. (a) Scatter plot of training data; (b) ANFIS surface after training. 
(MATLAB files: inv.sig.m and inv_fc.m) 


choosing inputs u(k), k = 1 to 101, as uniformly distributed random numbers 
between —1 and 1, and then employing Equation (17.16) [with y{ 1) = 0] to find 100 
training data pairs of the form [y(k),y(k + 1); u(k)], with k = 1 to 100. Figure 17.6 
shows the input and output sequences of the system to be controlled; Figure 17.7(a) 
is a scatter plot of the training data thus collected. After 30 training epochs, an 
ANFIS with nine rules exhibits a control surface shown in Figure 17.7(b). For the 
desired output specified by the equation 

ya(k) = 0.6sin(27rfc/250) 4- 0.2 sin(2^/r/50), 

the ANFIS controller achieves good performance, as shown in Figure 17.8, where the 
left-hand plot indicates desired and actual outputs of the plant and the right-hand 
the difference between them. 

This simple example serves to illustrate the concept of inverse control. Some 
remarks regarding the simulation are in order to highlight the strengths and short- 
comings of this method. 

• We do not need to know the plant dynamics in advance; identification of them 
is embedded in the training of ANFIS to find the inverse model. 

• Our simulation was based on off-line learning only. It is possible to turn on 
on-line learning to cope with time-varying plant dynamics. In general, the 
best approach is to use off-line learning to find a working controller for a 
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Figure 17.8. Performance of ANFIS controller for inverse control Due to the 
small error, it is hard to see the desired y{k), which is almost overlaid by the actual 
desired y{k). (MATLAB file: inv_fc.m) 


nominal plant, and then use on-line learning to fine-tune the controller if the 
plant is time varying. 

• We assume that at time step k,ya(k + 1) is available and it is used as an input 
to the ANFIS controller. If ya(k + 1) is not available until time step k + 1, 
then we can use as an input to the ANFIS controller at time step k , and 
the resulting overall system will behave like a unit-delay system. 

• Before using inverse learning, one should make sure the system to be controlled 
by this technique has a unique inverse. This is not so easy as the order of the 
plant dynamics increases. 

• The distribution of the training data could also pose a problem for this 
method. Ideally, we would like to see the training data distributed across 
the input space of the controller in a somewhat uniform manner. However, 
this may not be possible due either to the scarcity of the data (especially 
when there are many inputs) or to the limits imposed by the underlying plant 
dynamics. In our simulation, the lack of data at the upper right and lower 
right corners of the input space [see Figure 17.7(a)] is primarily the result of 
the underlying plant dynamics. This causes a sharp ascent and descent at 
each of the corners, as can be seen clearly in Figure 17.7(b). 

17.5 SPECIALIZED LEARNING 

A major problem with inverse learning is that an inverse model does not always 
exist for a given plant. Moreover, inverse learning is an indirect approach that tries 
to minimize the network output error instead of the overall system error (defined as 
the difference between desired and actual trajectories). Specialized learning [23] 
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U(k) 


Plant 



x(k+ 1) = f(u(k), x(k)) 


(a) 

K(k+1) 



Figure 17.9. (a) Plant block; (b) specialized learning using desired trajectory. 


is an alternative method that tries to minimize the system error directly by back- 
propagating error signals through the plant block. The price is that we need to 
know more about the plant under consideration. 

Figure 17.9 illustrates the most basic type of specialized learning, Figure 17.9(a) 
is the plant block (assuming its order is 1), and Figure 17.9(b) indicates the training 
of the ANFIS controller. The ANFIS parameters are updated to reduce the system 
error e x (fc), which is defined as the difference between the system’s output x(fc) and 
the desired output Xd(fc). 

To be more specific, let the plant dynamics be specified by 

x(fc + 1) = f(x.(k),v(k)) 

and the ANFIS output be denoted as 

v(k) = F(x(k),u(k),0), (17.17) 

where 0 is a parameter vector to be updated. (Without loss of generality, we assume 
the plant has a single scalar input u(k).) If we set the ANFIS output as the plant’s 
input, then v(k ) = v(k) and we have a closed-loop system specified by 

x(fc -I- 1) = f (x(fc),F(x(fc),u(fc),0)). 

The objective of specialized learning is to minimize the difference between the 
closed-loop system and the desired model. Hence we can defined an error mea- 
sure: 

J(6) = £ l|f(x(fc), F(x(fc), «(fc), 0)) - x d (k + 1)|| 2 . 

k 


(17.18) 
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u(k) 


Desired Model 



> x(k+ 1) = f(u(k), x(k)) 


(a) 



Figure 17.10. (a) Desired model block; (b) specialized learning with model refer- 
encing. 


As usual, we axe relying on backpropagation or steepest descent to update 9 to 
minimize the above error measure. To find the derivative of J{9) with respect to 9 , 
we need to know the derivative of f (•, •) with respect to its second argument. In other 
words, to backpropagate error signals through the plant block in Figure 17.10(b), 
we need to know the Jacobian matrix of the plant, where the element at row i 
and column j is equal to the derivative of the plant’s ith output with respect to its 
jth input. This usually implies that we need a model for the plant and the Jacobian 
matrix obtained from the model, which could be a neural network, an ANFIS, or 
another appropriate mathematical description of the plant. 

For a single-input plant, if the Jacobian matrix is not easily found directly, a 
crude estimate can be obtained by approximating it directly from the changes in the 
plant’s input and output(s) during two consecutive time instants. Other methods 
that aim at using an approximate Jacobian matrix to achieve the same learning 
effects can be found in refs. [4, 10, 27]. 

It is not always convenient to specify the desired plant output x<f(fc) at every 
time instant k. As a standard approach in model reference adaptive control (), 
the desired behavior of the overall system can be implicitly specified by a (usually 
lineax) model that is able to achieve the control goal satisfactorily. This is shown 
in Figure 17.9(b), where the desired output x<i(fc + 1) is generated via the desired 
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model depicted in Figure 17.9(a). Let the desired model be specified by 

x(k + 1) = f (x(k),u(k)). 

Then the error measure in Equation (17.18) becomes 

J(0) = Efc l|f(x(fc)>*>(*0) -x d (fc + 1)|| 2 _ 

= Efc Ilf (x(k),F(x(k),u(k),e)) - f(x(fc),u(&))|| 2 . 


(17.19) 


Again, we still need the Jacobian matrix of the plant to do backpropagation. 

Under certain circumstances, we do not need the Jacobian matrix of the plant 
to proceed backpropagation. Suppose that we are only interested in one element 
x(k) of the output vector x(k). If x(k) is feedback-linearizable system, its dynamics 
can be expressed as 


z(fc + l) = f(x(k),v{k)) 

= g(x{k)) + h(x{k))v(k). 


(17.20) 


If both g(-) and h(-) axe known perfectly, the desired model, denoted by Xd(k + 1) = 
f(xd(k),u(k)) y can be achieved by setting the input v(k) in Equation (17.20) as 
follows: 


v(k) = 


f(x(k),u(k)) - g(x(k) 
h{x(k)) 


If <?(•) is unknown and h(-) is known, we can use an ANFIS to approximate g(-) 
directly; the desired input-output pair for ANFIS training is 


[x(k);x d (k + 1) - h(x(k))u(k)]. 

Similarly, if h(-) is unknown and g(-) is known, we can use an ANFIS to approximate 
h(-) directly; the desired input-output pair for ANFIS training is 

[x(fc); (x d (k + 1) - g(x(k)))/u{k)]. 


In either of these two situations, the training of ANFIS does not required the Ja- 
cobian matrix of the plant. Also to make the training data as rich as possible, the 
input signal u(k) is preferably a random signal. 

Note that the ANFIS controller in Equation (17.17) represents the most general 
situation. More commonly, the ANFIS controller is a function of x(k) and 9 only 
and the input to the plant v(k ) is expressed as the difference between the command 
signal u(k) and ANFIS output, as follows: 


v{k) = u(k)-F(x{k),9). 
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Figure 17.11. A trajectory adaptive network for control application (FC stands 
for “fuzzy controller”). 

17.6 BACKPROPAGATION THROUGH TIME AND REAL-TIME 
RECURRENT LEARNING 

17.6.1 Fundamentals 

If we replace the controller and the plant block in Figure 17.1 with two adaptive 
networks, the feedback control system becomes the recurrent adaptive network dis- 
cussed in Section 8.4. Assuming that synchronous operation is adopted here (which 
virtually converts the system into a discrete-time domain), we can apply the con- 
cept of unfolding of time introduced in Chapter 8 to obtain a feedforward network, 
and then use the same backpropagation learning algorithm to identify a set of pa- 
rameters to generate satisfactory trajectories. 

To obtain the state trajectory, we cascade the block diagram (or adaptive net- 
work, if both the controller and the plant are replaced with appropriate adaptive 
networks) in Figure 17.3 to obtain the trajectory adaptive network shown in 
Figure 17.11. In particular, the inputs to the trajectory adaptive network are the 
initial conditions of the plant; the outputs are the state trajectories from k = 1 
to k = m. The adjustable parameters all pertain to the FC (fuzzy controller) 
block implemented as an ANFIS. Although there are m FC blocks, all of them refer 
to the same parameter set. For clarity, this parameter set is shown explicitly in 
Figure 17.11 and is updated according to the output of the error measure block. 

Each entry in the training data set for the trajectory network is of the following 
format: 


(initial condition; desired trajectory), 
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and the corresponding error measure to be minimized is 

m 

E = ~Xd(k)\\ 2 , 

fc = 1 

where Xd(k) is a desired state vector at time step k (which corresponds to real time 
kT\ T is the sampling period). If we take control efforts into consideration, a revised 
error measure would be 


m m— 1 

E = ll x (*0 - Xd(fc) II 2 + A 51 H fc )ll 2 ’ 

k=l k = 0 

where u (k) is the control action at time step k. By proper selection of A, a com- 
promise between trajectory error and control efforts can be obtained. 

Note that backpropagation through time (BPTT) is usually an off-line learning 
algorithm in the sense that the parameters will not be updated until the sequence 
(k = 1 to m) is completed. If the sequence is too long, or if we want to update the 
parameters in the middle of the sequence, we can always apply real-time recurrent 
learning (RTRL), as introduced in Chapter 8. 

Use of BPTT to train a neural network to back up a tractor-trailer system was 
reported in ref. [19]. The same technique was used to design an ANFIS controller 
for balancing an inverted pendulum [8]; this is discussed thoroughly next. 

17.6.2 Case Studies: the Inverted Pendulum System 

In this section, we present a detailed description of our simulation that used BPTT 
to find a fuzzy controller for the inverted pendulum system, as reported in ref. [8]. 
For simplicity, we consider the pole dynamics only. Thus, the plant has only two 
state variables — xi and X 2 — representing, respectively, the pole angle and angu- 
lar velocity. The differential equation for the pole dynamics is provided in Equa- 
tion (17.3). Note that this is a feedback linearizable system and there exists other 
advanced nonlinear control design methods (e.g., sliding mode control, see Sec- 
tion 18.5). Here we shall use the pole system as a simple application example of 
BPTT, which does not exploit the feedback linearizability of the system. 

To use BPTT, the first thing we have to do is system identification (see Chap- 
ter 5) — that is, find an adaptive network representation of the plant block in Fig- 
ure 17.3. In fact, we can choose whatever function approximators that can best rep- 
resent the input-output behavior of the plant, as long as the chosen approximator 
is piecewise differentiable. Potential candidates for the plant approximator include 
conventional linear or nonlineax difference equations and unconventional network 
structures (neural networks [24], radial basis function networks [18], GMDH (group 
method of data handling) structures [7], functional-link networks [11, 20], ANFIS, 
and so on). This model-insensitive attribute is mostly due to the extreme flexi- 
bility of adaptive networks which allows implementation of all kinds of piecewise 
differentiable functions. 
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Controller Block Plant Block 



-+*( t+h ) 
^(t+h) 


Figure 17.12. Network implementation of the discrete control block diagram in 
Figure 17.3. 


If the plant can be modeled as a set of n (= number of state variables) first- 
order difference equations, then the plant block can be replaced with n nodes, each 
of which uses one difference equation to obtain the state variable at the next time 
step. For simplicity, we assume here that the plant is represented by the difference 
equations 


J xi(k + 1) = hxi(k) +xi(k), (17 ou 

\ s 2 (fc + l) = hx2(k) 4- X2(k), ' ' 

where xi and ±2 are specified in Equation (17.5). These two equations are the node 
functions of the plant block in Figure 17.12. 

The controller block in Figure 17.11 is implemented as an ANFIS with two 
inputs, each of which is assigned two membership functions, so it is a fuzzy controller 
with four fuzzy if-then rules of Sugeno’s type [26]. (See the controller block in 
Figure 17.12.) 

As a result of replacing blocks with adaptive networks, the block diagram in 
Figure 17.3 becomes an adaptive network containing two subnetworks, the FC block 
(ANFIS) and the plant block. This adaptive network is referred to as a stage 
adaptive network at time stage k. The trajectory adaptive network shown in 
Figure 17.11 contains m replicas of stage adaptive networks at different time steps. 

For the controller block, we assume that no domain knowledge (from a human 
expert) about the inverted pendulum system is available. Without any domain 
knowledge, we have to set the initial parameters for the ANFIS controller in a 
general and unbiased way. The consequent parameters are all set at zero, which 
means the control action is zero initially, as shown in Figure 17.14(a). The premise 
parameters axe set in such a way that the membership functions can cover the 
domain intervals (or universe of discourse) completely with sufficient overlapping 
of each other. Figures 17.13(a) and Figures 17.13(b) illustrate the generalized bell 
membership functions before training; the domain intervals for 6 (degrees) and 6 
(deg/s) axe assumed to be [—20, 20] and [—50, 50], respectively. 

We employ 100 stage adaptive networks to construct the trajectory adaptive 
network, and each stage adaptive network corresponds to a time transition of 10 
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(a) 


(b) 




Figure 17.13. (a)(b) Initial membership functions; (c)(d) final membership func- 
tions. 

microseconds. That is, the sampling period T used is 10 ms, and the trajectory 
adaptive network corresponds to a time interval from t = 0 to t = 1 sec. If T is too 
small, a large network has to be built to cover the same time span, which increases 
the signal propagation time and thus delays the whole learning process. On the 
other hand, if T is too big, then the linear approximation of the plant’s behavior 
may not be good enough, requiring that a more precise difference equation for the 
plant be used. 

The training data set contains desired input-output pairs of the format 

initial condition; desired trajectory), (17.22) 

where the initial condition is a two-element vector that specifies the initial condition 
of the pole; the desired trajectory is a 100-element vector that contains the desired 
pole angle at each time step. In our simulation, only two training data entries 
are used: the initial conditions are (10, 0) and (—10, 0), respectively, and the 
desired trajectory is always a zero vector of size 100. In other words, we expect the 
controller to be able to bring the pole back to the upright position starting from 
either -1-10 or —10 degrees. The error measure used here is 

100 99 

E = ^0 2 (O.Olfc) + A]T/ 2 (0.01fc), (17.23) 

fc=l k = 0 

where f (0.01k) is the controller’s output force and A (= 10) accounts for the relative 
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Figure 17.14. Control action surfaces : (a) before training; (b) after training. 


unit cost of the control effort. 

To speed up convergence, we follow a strict steepest descent in the sense that 
each parameter update leads to a smaller error measure. If the error measure 
increases after a parameter update, we back up to the original point in the parameter 
space and decrease the current step size by half. This process is repeated until a 
parameter update leads to a smaller error measure. However, this step size update 
procedure tends to use a small step size if the error measure surface encountered 
in the first few updates is smooth. We therefore multiply the step size by 4 after 
observing three consecutive updates without any backing-up action. The initial step 
size in the simulation is 20, and the learning process stops whenever the number of 
successful parameter updates (which is equal to the number of reductions in error 
measure) reaches 10. We did play around with the initial step size and found that 
if the initial step size was too small, the training process converged prematurely to 
a set of parameters, presumably a local minimum, that do not really minimize the 
error measure. 

All the aforementioned simulation settings are referred to as the reference set- 
ting ; other simulations are based on this setting with minor changes. In the learning 
task with the reference setting, it is observed that the FC balance the pole right 
after the first parameter update and keep on refining the controller (minimizing the 
error measure) until the 10th successful parameter update. Figures 17.13(a) and 
Figures 17.13(b) show the initial membership functions for pole angle and angular 
velocity; Figure 17.13(c) and Figure 17.13(d) show the final membership functions. 
If 9 is in degrees and 9 is in deg/s, the initial fuzzy if-then rules are 


( if 9 is A\ and 9 is B\, then force — 0; 

if 9 is A\ and 9 is £ 2 , then force = 0; 

if 9 is A 2 and 9 is Z?i, then force = 0; 

if 9 is A 2 and 9 is B 2 , then force = 0; 


(17.24) 
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Figure 17.15. (a)(b) Initial membership functions and (c)(d) final membership 
functions of a nine-rule fuzzy controller. 


where Ai, A 2 , and B 2 axe the linguistic labels characterized by the generalized 
bell MF parameters (20, 2, —20), (20, 2, 20), (50, 2, —50), and (50, 2, 50), 
respectively. Figure 17.14(a) is the initial control action surface. 

The final fuzzy if-then rules derived from the reference settings are as follows: 

( if 0 is A\ and 0 is B\, then force = 0.05020 + 0.16460 — 10.09; 

if 0 is A\ and 0 is B 2 , then force = 0.00830 + 0.01190 — 1.09; 

if 0 is A 2 and 0 is Bi, then force = 0.00830 + 0.01190 + 1.09; 

if 0 is A 2 and 0 is B 2 , then force = 0.05020 + 0.16460 + 10.09; 

where Ai, A 2 , B 1 , and B 2 are the linguistic labels characterized by the generalized 
bell MF parameters (-1.59, 2.34, -19.49), (-1.59, 2.34, 19.49), (85.51, 1.94, -23.21), 
and (85.51, 1.94, 23.21), respectively. Figure 17.14(b) is the final control action 
surface. 

Figure 17.13 indicates that the final membership functions for 0 are quite dif- 
ferent from the initial membership functions. Note that there appears to be no 
membership functions covering the interval [—25, 25] of 0, making linguistic inter- 
pretation of the fuzzy rules difficult. (In fact, MF grades in the interval [—25,25] 
are never zero; they are just too small to be noticeable in the plot.) However, 
since we are utilizing ANFIS as a function approximator that can generate a re- 
quired nonlinear mapping, linguistically desirable features (such as enough overlap 
between neighboring membership functions and total coverage of the whole input 
domain) do not have to be one of the fuzzy controller’s attributes in this case. If we 
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(a) (b) 





Figure 17.16. (a) Pole angle; (b) pole angular velocity; (c) state space; (d) in- 
put force. I Solid , dashed , and dotted curves correspond to A = 10, 40, and 100, 
respectively .] 


want to keep these desirable features, we can either impose some constraints on the 
premise parameters or simply increase the number of MFs to give the neuro-fuzzy 
controller more degrees of freedom. Figure 17.15 shows the membership functions 
of a nine-rule fuzzy controller that has about the same performance as the four-rule 
fuzzy controller. Due to its greater number of degrees of freedom, the premise pa- 
rameters of the nine-rule fuzzy controller do not have to change a lot to minimize 
the error measure; therefore, the final membership functions clearly cover all the 
domain intervals with desirable overlapping. 

Solid curves in Figure 17.16 demonstrate the state variable trajectories at the 
reference setting: (a), (b), and (d) show the pole angle (degrees), angular velocity 
(deg/s), and control actions (N) from t = 0 to t = 2 s; (c) is the state-space plot 
that shows how the trajectory approaches the origin from the initial point (10, 0). 
Dashed and dotted curves in Figure 17.16 correspond to A equal to 40 and 100, 
respectively. From Figure 17.16(a), it can be seen that a smaller A (solid curve) 
achieves the control goal faster since the controller can apply a larger force to balance 
the pole. For a large A (dotted curve), the controller’s output has to be kept small, 
thus slowing down the approach to the goal. 

To demonstrate how the fuzzy controller can survive substantial changes in plant 
parameters, we used poles of different lengths to test the controller obtained from the 
reference setting. The results are shown in Figure 17.17, where solid, dashed, and 
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(a) (b) 




Figure 17.17. (a) Pole angle; (b) pole angular velocity; (c) state space; (d) input 
force. [Solid, dashed, and dotted curves correspond to half pole lengths of 0.5, 0.25, 
and 0.125 m, respectively.] 


dotted curves correspond to half-lengths of the pole equal to 0.5 (reference setting), 
0.25, and 0.125 m, respectively. The controller obtained from the reference setting 
can handle the shorter pole easily and gracefully. 

In the learning phase, we supply only two training data corresponding to ini- 
tial conditions (10, 0) and (—10, 0) of the pole. It would be interesting to know 
how the FC (obtained from the reference setting) deals with other initial condi- 
tions. So we monitor the pole behavior starting from other initial conditions which 
make the control goal even harder. Figure 17.18 shows the results; the solid solid, 
dashed, and dotted curves correspond to the initial conditions (10, 20), (15, 30), 
and (20, 40), respectively. Again, the same fuzzy controller can perform the control 
task starting from the unseen initial conditions not used for training. Figure 17.18 
and Figure 17.17 reveal the robustness and fault tolerance of the fuzzy controller 
obtained via backpropagation through time. 


17.7 SUMMARY 

This chapter presents five design techniques for neuro-fuzzy controllers: expert 
control, inverse learning, specialized learning, backpropagation through time, and 
real-time recurrent learning. Most of these techniques are derived directly from 
neural control literature. Other design techniques that are not close coupled with 
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(a) (b) 



Time (s) Pole Angle (deg) 

Figure 17.18. (a) Pole angle; (b) pole angular velocity; (c) state space; (d) input 
force. [Solid, dashed, and dotted curves correspond to initial conditions (10, 20), 
(15, 30), and (20, 40), respectively.] 


neural networks are described in the next chapter. 


EXERCISES 

1. In Example 17.2, if the control signal in the state equation in Equation (17.11) is 
a vector of size m and matrix B is n x m, how do you modify the controllability 
(and thus invertibility) condition? 

2. Assume that the inputs to the state equation in Equation (17.7) is a vector of 
size m. How does the multi-input plant change the learning and application 
stages discussed in Section 17.4? 
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Chapter 18 


Neuro-Fuzzy Control II 


18.1 INTRODUCTION 

In the previous chapter, we introduced neuro-fuzzy control and some design ap- 
proaches that use neuron-like learning directly; these include expert control, inverse 
learning, specialized learning, backpropagation through time, and real-time recur- 
rent learning. 

In this chapter, we shall introduce more design methods that do not rely to- 
tally on neural-like learning. Instead, some of them employ derivative-free opti- 
mization (see Chapter 7) or reinforcement learning techniques (see Chapter 10); 
others employ conventional control techniques (such as adaptive control, feedback 
linearization, sliding mode control, gain scheduling, and so on) to accomplish their 
tasks. 

18.2 REINFORCEMENT LEARNING CONTROL 

Reinforcement learning plays an important role in the adaptive control field. 
It surely helps, especially when no explicit teacher signal is available in the envi- 
ronment (or world) where an interacting agent must learn to perform an optimal 
control action. The world informs the agent of a reinforcement signal associated 
with the performed control action and of the resulting new state (see Figure 18.1). 

This section provides a brief description of reinforcement learning control and 
neuro-fuzzy reinforcement control systems. (For more a thorough discussion on 
general aspects of reinforcement learning, refer to Chapter 10.) 

18.2.1 Control Environment 

A basic problem in feedback control is that of determining an appropriate control 
action at each time instant to optimize a long-term objective. For a control goal 
explicitly defined as an objective function, this is achievable by using any of the 
supervised learning methods. However, such an explicit objective function is not 
always available; occasionally the only information about the agent’s performance is 
a scalar score (usually called reinforcement) indicating how good the current action 
is; or even just a binary signal indicating whether the action is right or wrong. The 
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Figure 18.1. An interactive learning agent. 


reinforcement signal is obviously a low-quality feedback signal. In other words, it 
is evaluative rather than instructive. (The learning agent receives either reward 
or punishment according to such evaluations.) Furthermore, the signal is often 
delivered infrequently and delayed — it is not available at each time instant; and 
when it is available at a certain moment, it represents the results of a series of 
control actions probably performed over a lengthy period of time. Note that a 
plant model is not necessarily required for this type of learning; the attempt of 
achieving a given control objective often results in a longer learning time. 

In the reinforcement learning control literature, the pole-balancing control prob- 
lem has been widely explored by many researchers [1, 2, 4, 16, 20, 23, 29]. (See also 
Section 10.5.1 in Chapter 10.) In this control problem, the only training signal avail- 
able is the knowledge that the cart has reached a certain maximum displacement 
or that the pole has reached a maximum angle of deviation. 

18.2.2 Neuro-Fuzzy Reinforcement Controllers 

The basic idea behind fuzzy reinforcement learning is to apply a fuzzy partitioning 
scheme to the continuous state space and to introduce linguistic interpretation. Such 
averaging over neighboring partitioned subspaces can create generalization abilities; 
previously unknown states can thus be evaluated unlike the Boxes system [20] we 
discussed in Section 10.5.1 of Chapter 10. The Boxes system divided the entire 
state space into 162 non-overlapping digitized subspaces (boxes). 

There are two representative neuro-fuzzy reinforcement learning models: G ARIC 
(Generalized Approximate Reasoning for Intelligent Control), from Berenji and 
Khedkar [4, 3], and RNN-FLCS (Reinforcement Neural-Network- based Fuzzy Logic 
Control System) from Lin and Lee [16]. These models basically realize the AHC 
(adaptive heuristic critic) idea illustrated in Figure 18.2; the AHC architecture 
typically consists of an action (or control) module and of an critic (or evaluation) 
module. (For more details on AHC, refer to Chapter 10.) 

GARIC has three components [4]: the action selection network (ASN), the action 
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Critic Module Action Module 



Figure 18.2. A neuro-fuzzy AHC model. 


Table 18.1. Comparison of the three AHC models: AHCON, GARIC, and RNN- 
FLCS. In this table, neuro means a multilayer perceptron. 


AHC 

models 

Critic 

module 

Action 

module 

AHCON 

Neuro 

Neuro 

GARIC 

Neuro 

Neuro-fuzzy 

RNN-FLCS 

Neuro-fuzzy 

Neuro-fuzzy 


evaluation network (AEN), and the stochastic action modifier (SAM). It has basi- 
cally the same architecture as Lin’s AHCON (AHC connectionist) model [17, 18], 
except that the ASN is expressed in a neuro-fuzzy framework. Lin and Lee’s RNN- 
FLCS consists of fuzzy controller (action NN) and fuzzy predictor (value NN); 
they viewed the “stochastic action selection” function as part of the action NN. 
(To treat the stochastic action selection function as an individual component or 
not is inevitably a matter of individual judgment.) The structural comparison 
of these three models is presented in Table 18.1. Several variations have already 
been proposed; for instance, the critic module can be replaced by an RBFN [14] 
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or a CM AC [15, 34, 35]. (Pedrycz discussed a fuzzy controller as a CMAC com- 
ponent [24].) By contrast, the whole RNN-FLCS is expressed in a neuro-fuzzy 
framework; both critic (fuzzy predictor) and action module (fuzzy controller) share 
the antecedent parts of the fuzzy rules. That is, both fuzzy functional modules 
share the same input fuzzy membership functions, just like CANFIS, discussed in 
Chapter 13. In GARIC, on the other hand, the ASN and the AEN do not share 
the antecedent parts because the AEN is expressed as a two-layer feedforward NN 
with sigmoidal functions except in the output layer. The ASN part of GARIC is 
a five-layer neuro-fuzzy model that has almost the same conceptual framework as 
our proposed ANFIS/CANFIS. 

18.3 GRADIENT-FREE OPTIMIZATION 

Most of the lear ning algorithms for automatic control applications introduced so far 
are derivative-based optimization techniques. However, the calculation of gradient 
vectors can be messy if the plant under consideration is complicated or the process 
is lengthy due to sluggish dynamics (see Figure 17.11). If this is the case, then 
derivative-free optimization schemes are preferable alternatives for optimization- 
based control designs. As stated in Chapter 7, four of the most popular optimization 
methods of this kind are genetic algorithms [7, 8], simulated annealing [11], the 
random optimization method [19], and the downhill Simplex method [21]. To apply 
any of these algorithms to control applications involves the following three steps: 

1. Define a parameterized controller. It could be a linear controller, a neural 
network, a fuzzy controller, etc. (Since we are not going to rely on the gradient, 
the controller could even be non-differentiable with respect to its parameters.) 

2. Define an objective function that relates to the control goal. Usually we 
minimize the objective function to achieve the control goal. In other words, 
the smaller the objective function, the better the control performance. 

3. Find the objective function (usually by simulation) and update the controller’s 
parameters by any of the derivative-free optimization methods; repeat until 
the objective function is below a given value or the computing time exceeds 
a specified upper bound. 

In step 3, we need a plant model to find the objective function, and the plant 
model is assumed to be correct throughout the entire design process. So basically 
this is an off-line design method and it relies directly on the correctness of the plant 
model. 

Note that the preceding steps also apply to modeling tasks, except that the 
objective function is defined as an error measure that describes the discrepancy 
between desired outputs and model’s output. 

Any of the four derivative-free methods can be used in step 3 of the preceding 
design procedure. However, for the rest of this section, we shall focus our discussion 
on genetic algorithms, for the following reasons: 
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• Exploitation of GAs for neural or fuzzy control has been around for a while 
and thus there is a more extensive body of literature than on the other three 
methods. Examples of using GAs for neural network controllers can be found 
in ref. [36]; for fuzzy logic controllers, see refs. [9, 10, 13]. 

• GAs are parallel in nature; this could be a decisive advantage when used with 
parallel machines. 

• Above all, the application of GAs is not as straightforward as the other 
three methods. The definitions of the coding scheme and genetic operators 
(crossover and mutation) could be trickier than they appear. If these are not 
defined appropriately to match the nature of the problem to be solved, GAs 
behave somewhat like a parallel random search method. 

Moreover, we shall further narrow our scope by discussing GAs for fuzzy control 
only, although similar approaches can be used for neural control as well. Specifi- 
cally, we shall explain how to define coding schemes and genetic operators for fuzzy 
control, and how to embed a priori knowledge into GAs. 

18.3.1 GAs: Coding and Genetic Operators 

Successful use of GAs depends heavily on coding strategies for underlying applica- 
tions; fuzzy control is no exception. The coding scheme for a fuzzy inference system 
(FIS) refers to the way of arranging the parameters of the FIS into a bit-string 
representation (or chromosome) such that the representation preserves certain 
good properties after recombination specified by genetic operators like crossover 
and mutation. 

Figure 18.3 illustrates the hierarchical structure of GAs for fuzzy inference sys- 
tems. The topmost level indicates that each generation contains a number of FISs as 
individuals in a population; the lowerest level demonstrates that each parameter is 
represented by a unit of eight bits, called a gene, in a long bit-string representation 
of a chromosome. 

The coding scheme in Figure 18.3 is straightforward, but it is too simple to be 
of practical value because we did not pay any attention to the rule structure or 
the relationship between neighboring membership functions. This simple-minded 
coding scheme may result in problems such as the following: 

• The crossover operator produces some random child FISs that do not inherit 
good properties from their parents. 

• The crossover and/or mutation operators produces incomplete or ill-defined 
FISs. For instance, there could be a “hole” in the input space that is not 
covered by any membership functions. 

• Structure-level adaptation that changes the number of rules and number of 
MFs are not explicitly accounted for. 
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FIS Population 


I/O Description 


MF Description 


Parameter Rep. 


Bit String Rep. 

(Gene) 

Figure 18.3. Hierarchical representation of FIS in GA. 

• There could be difficulty accommodating a priori knowledge about the target 
system, such as symmetries, regularities, and homogeneities, that are not 
easily encoded in general. Genetic operators would then be likely to rupture 
these good properties in child FISs. 

There are many coding schemes that can deal with these problems. Here we 
give two examples to show the general flavor of a good coding strategy. 

Example 18.1 Coding scheme for orthogonal MFs [22] 

Figure 18.4 demonstrates a coding scheme for membership functions (MFs), where 
the center position of each MF is represented by “1”. Usually the first and the 
last bits of the string representation are “1” to ensure coverage of the boundaries. 
Figure 18.5 illustrates the effects of the crossover and mutation operators. The 
child strings after crossover, as shown in Figure 18.5(a), still preserve the condition 
of orthogonality (that is, the sum of all MF values for a specific input value is 
always equal to unity, see the definition in Chapter 2), thus eliminating the risk of 
accidentally introducing a “hole” in the input domain. The effects of mutation, as 
shown in Figure 18.5(b), are equivalent to adding an MF (when 0 — ► 1) or deleting 
one (when 1 — »• 0) while keeping orthogonality. Obviously, the genetic operators in 
this example change both shapes as well as numbers of MFs; this amounts to both 
parameter- and structure-level adaptation. 

□ 
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Figure 18.4. Orthogonality-preserved coding. 







Figure 18.5. Crossover and mutation for orthogonality-preserved coding. 


Example 18.2 Genetic operators for input space partitioning 

Input space partitioning determines the premise part of a fuzzy rule set. For in- 
stance, the tree-style partitioning of Figure 18.6 divides the input space into six 
regions, each of which defines the premise part of a rule. This input space par- 
titioning can be represented as a string C 1 C 2 • • -C n - i, where n is the number of 
rules and each Ci encodes the information for a straight-line cut, which includes 
the position of the cut and the dimension in which it is executed. This string be- 
comes a common chromosome when each Ci is replaced by a bit string. Figure 18.7 
illustrates one way of defining the crossover and mutation operators. The crossover 
operator essentially restricts the crossover points to those points separating Ci s. 
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ABODE 


Figure 18.6. String representation for input space partitioning. 



ABCD CDAB 


Figure 18.7. Crossover and mutation for input space partitioning. 

The mutation operator shifts the string in a cyclic manner; the amount is a random 
number between 1 and n — 1. Note that when there is more than one way to execute 
a cut, we always choose the longest way. 


□ 
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18.3.2 GAs: Formulating Objective Functions 

To apply any optimization method to control design, we need to formulate an 
objective function directly related to the control goal to be achieved. In optimum 
control literature, a general format for the objective function is [5] 

k=N - 1 

J = 5(*(W)) + £ L(x(fc),u(fc)), (18.1) 

k = 0 

where x(k) and u (k) are the output and control actions, respectively, at time k; 
and N is the stop time for the process under consideration. 

If we set 5 = 0 and L = u T u, then J = £)u T u is a measure of control 
effort (or energy). The minimization of J is called the least-effort problem. If we 
set S = ||x(iV) — Xd(iV)|| 2 and L = 0, then minimizing J is called the minimum 
terminal-error problem since J specifies the square of the norm of the error between 
the final state x(iV) and a desired final state x^ (TV) . In particular, for a linear 
system with the following quadratic objective function 

k=N~l 

J = x(iV)Mx(iV) -I- ^ [x T (fc)Qx(fc) + u T (fc)Ru(fc)], (18.2) 

k = o 

where M and Q are symmetric positive semi-definite matrices and R is a symmetric 
positive-definite matrix, there exists a unique linear controller that solves this finite- 
time regulator problem analytically [5, 12]. 

When using derivative-free optimization methods, we can employ an even more 
complex objective function than Equation (18.1). This means that we can incorpo- 
rate structure-level information into the objective function and let the derivative-free 
optimization methods to do the whole job: finding the best structure, as well as 
the optimal parameters; in neural networks, finding the correct number of neurons, 
as well as the proper connection weights; in fuzzy controllers, finding the correct 
number of rules, as well as the proper MF parameters. This seems too good to be 
true. However, keep in mind that derivative-free methods are slow and they could 
take a tremendous amount of time to obtain a less-than-optimal solution. 

Let the original objective function in Equation (18.1) be denoted as J'. Then a 
new objective function that also takes the number of MFs into consideration could 
be 

J = J' + <5(total number of MFs), (18.3) 

where 8 is a constant that specifies the importance of minimizing the number of 
MFs. Usually, more MFs implies more rules, so minimizing J also indirectly reduces 
the number of rules. 

In the same vein, we could define a new objective function as 

_ (settling time) + k\ J' 

(number of rules) -I- k^ 


(18.4) 
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Figure 18.8. Schematic diagram for (1) Sugeno fuzzy controller; (b) gain- 
scheduling fuzzy controller. 


in which k\ and k-z are constants. This objective function tries to reduce both the 
settling time and the number of rules. This was used in the fuzzy controller design 
in ref. [13]. 


18.4 GAIN SCHEDULING 
18.4.1 Fundamentals 

A regular first-order Sugeno fuzzy controller uses its inputs both in the premise part 
to determine firing strengths, and in the consequent part to determine each rule’s 
output; this is shown in Figure 18.8(a), where both the premise and consequent 
parts receive the same inputs. For certain applications, it suffices to use only some 
of the inputs for the premise part and the others for the consequent part, as shown 
in Figure 18.8(b). Obviously, Figure 18.8(b) is a special case of 18.8(a), but is very 
useful in designing fuzzy controllers based on the concept of gain scheduling. 

Specifically, the inputs to a gain-scheduling fuzzy controller contain two 
types of variables: scheduling and state. Scheduling variables are used in the premise 
part to determine what mode or characteristics the plant has. Once the plant mode 
or characteristics have been determined, the corresponding rule is fired with an 
output equal to the state variables multiplied by appropriate state feedback gains. 
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In terms of Figure 18.8(b), z and u are the scheduling variables and x and y are 
the state variables. 

This is best explained by a simple example. For a hypothetical inverted pendu- 
lum system with a varying pole length, a gain-scheduling fuzzy controller may have 
the following fuzzy if-then rules: 

{ If pole is short, then /i = k\\9 + ki20 + k\ 3 z + k\±z. 

If pole is medium, then /2 = k 2 id + £ 22 # + k 2 3 z + A^i. (18.5) 

If pole is long, then f 3 = k 3 id + k 32 9 + k 33 z + k 3 4 Z. 

This is actually a gain-scheduling controller, where the scheduling variable is the 
pole length and the state variables are [9, 9 , z, z\. Depending on the value of 
the scheduling variable, the control action switches smoothly among three sets of 
feedback gains, each of them designed specifically for a certain range of the schedul- 
ing variable. The key feature of this controller that makes it different from other 
gain-scheduling controllers is that the feedback gains are blended smoothly via 
membership functions that give linguistic meanings to the scheduling variable. 

A more detailed description of the design approach is a s follows: 

1. Determine a set of representative points in the scheduling variable space. 
These points should be distributed more or less uniformly throughout the 
scheduling variable domain. 

2. Construct MFs for the scheduling variables such that each representative point 
fires a rule at maximum strength and the number of rules is equal to the 
number of representative points. 

3. Find the feedback gains at each representative point. This can be achieved by 
any conventional linear control technique, such as pole placement, quadratic 
optimal design, and gain/phase margin methods. The feedback gains thus 
found are used in the output equation of each rule corresponding to a repre- 
sentative point. 

In the preceding approach, we assumed that the number of representative points 
was small, so we could construct fuzzy rules directly. This corresponds to the inter- 
polation problem, in which the controller should use exactly the same specified 
feedback gains at each representative point and use the interpolated results between 
representative points. On the other hand, if we have enough computing power to 
deal with a large number of representation points, we can always use ANFIS to 
fit desired control actions to a gain-scheduling fuzzy controller; this corresponds to 
the approximation problem. Figure 18.9 shows the situation when we have two 
scheduling variables. 

Examples of applying this method to both one-pole and two-pole inverted pen- 
dulum systems with varying pole lengths are explained in the following section. 
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Figure 18.9. Applying gain scheduling to the design of a fuzzy controller: (a) 
interpolation when the number of operating points is small; (b ) approximation when 
the number of operating points is large . 



Figure 18.10. MFs for scheduling variable in the CP system with a varying pole 
length. 


18.4.2 Case Studies 

Cart and Pole System with a Varying Pole Length 

Here the pole length follows a sinusoidal wave that changes between 0.5 and 1.5 
m. We assume that the pole has a constant density, so the changes in pole length 
also imply changes in pole mass. We used three representative points (0.5, 1, and 
1.5) of the scheduling variable (that is, pole length) to construct three fuzzy rules, 
as described in Equation (18.5). The membership functions for the pole length are 
II functions shown in Figure 18.10. The feedback gains of each rule were obtained 
via the linear quadratic optimal design method (with the MATLAB command lqr), 
where the linearized model was derived (with the MATLAB command linmod) at 
the origin and the representative point for the state and the scheduling variables, 
respectively. 

Figure 18.11 is the SIMULINK block diagram of this system. Once the Simula- 
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Figure 18.11. Block diagram for the CP system with a varying pole length. 



Figure 18.12. Animation for CP system: (a) animated graphic representation; (b) 
time-lapse plot. 


tion starts, an animation window [Figure 18.12(a)] displays the motion of the cart 
and the pole; the triangle is the desired cart position and the arrow indicates the 
direction and the magnitude of the applied force. If we take a snapshot at each 
time step, it can clearly be seen that the pole length follows a sinusoidal wave, as 
shown in Figure 18.12(b). 

We found that this system is extremely robust; the controller can balance the 
pole as well as move the cart to a desired position even when the pole length is 
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Figure 18.13. MFs for scheduling variable in the CPP system with a varying pole 
length. 


changed randomly within the interval [0.5, 1.5] m. However, this is not the case for 
the next example. 

Cart and Parallel Poles System with a Varying Pole Length 

The CPP system has two poles (A and B) on the same cart; the control task is 
to balance both poles and at the same time move the cart to a desired position. 
Note that the length of pole B is fixed at 1 m, but that of pole A is time varying 
between 0.5 and 1.5 m. This makes the control task difficult since the system is not 
controllable when both poles have the same length and the same mass. (Consider 
the situation when A and B are exactly the same. Then due to symmetry, it is 
impossible to drive two poles to zero angles if the initial conditions are, say, —10 
degrees for A and +10 degrees for B.) Moreover, it is obvious that the control 
strategy when pole A is longer than pole B is very different from the one when pole 
A is shorter than pole B. 

Our design approach was almost the same as in the CP system discussed previ- 
ously, except that we had to use more rules to deal with the extreme sensitivity of 
this system. Here we used 11. rules; the n MFs for the scheduling variable (length 
of pole A) are shown in Figure 18.13. Again, we used linmod to find the linearized 
model at each representative point and lqr to obtain feedback gains for each rule. 

Figure 18.14 is the SIMULINK block diagram of this system. Figure 18.15(a) 
displays the initial conditions of the system; 18.15(b) is a time-lapse plot that shows 
how the length of pole A changes with time. 

Unlike the CP system described earlier, the CPP system is extremely sensitive 
and it could easily become unstable if the fuzzy rules are too few or the length of 
pole A is too close to that of pole B for a too long period of time. 

18.5 FEEDBACK LINEARIZATION AND SLIDING CONTROL 

The equations of motion of a class of dynamic systems in a continuous-time domain 
can be expressed in the canonical form: 

x (n) (t) =f(x(t),x(t),---x< n - 1) 


( t)) + bu(t ), 


(18.6) 
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Figure 18.14. Block diagram for the CPP system with a varying pole length. 



(a) 



(b) 


Figure 18.15. Animation for CPP system: (a) animated graphic representation; 
(b ) time-lapse plot. 


where / is an unknown continuous function, b is the control gain, and u £ R and 
y G R are the system’s input and output, respectively. The control objective is to 
force the state vector x = [x,x, . . . ,x( n-1 )] T to follow a specified desired trajectory 

— fed, ■ ■ • > 1 ^] T . If we define the tracking error vector as e = x — x^, then 

the control objective is to design a control law u(t) which ensures e — > 0 as t — > oo. 
(For simplicity, we assume b = 1 in the following discussion.) 

Equation (18.6) is a typical feedback linearizable system since it can be re- 
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duced to a linear system if / is known exactly; specifically, the control law 

u(t ) = - f{x(t )) -I- x + k T e (18.7) 

would transform the original nonlinear dynamical equation into a linear one: 

e^ n \t) + H l-fc n e = 0, (18.8) 

where k = [fc n , . . . , ki] T is an appropriately chosen vector that ensures satisfactory 
behavior of the closed-loop linear system in Equation (18.8). 

Since / is unknown, an intuitive candidate for u would be 

u = ~F(x, p ) -I- + k T e + v , (18.9) 

where v is an additional control input to be determined later, and F(-) is a param- 
eterized function (such as an ANFIS, neural network, or any other type of adaptive 
network) that has enough degrees of freedom to approximate /(•). Using this control 
law, the closed-loop system becomes 

e (n) + + • • • + k n e = (/ — F) + v. (18.10) 

Now the problem is divided into two tasks: 

• How to update the parameter vector p incrementally so that F(x, p) « f(x) 
for all x. 


• How to apply v to guarantee global stability while F is approximating / during 
the whole process. 

The first task is not too difficult as long as F is equipped with enough parameters 
to approximate /. For the second task, we need to apply the concept of a branch 
of nonlinear control theory called sliding mode control [27, 32]. The standard 
approach is to define an error metrics as 

s (t) = (^ + A) n-1 e(t), with A > 0. (18.11) 


The equation s(t) = 0 defines a time- varying hyperplane in R n on which the tracking 
error vector e(t) = [e(t),e(t ), . . . ,e n ~ 1 {t)] T decays exponentially to zero, so that 
perfect tracking can be obtained asymptotically. Moreover, if we can maintain the 
following condition: 


rf l g (01 

dt 


< -v, 


(18.12) 


then \s(t)\ will approach the hyperplane \s(t)\ = 0 in a finite time less than or equal 
to |s(0) I/77. In other words, by maintaining the condition in Equation (18.12), s(t) 
will approach the sliding surface s(t) = 0 in a finite time, and then the error vector 
e(t) will converge to the origin exponentially with the time constant (n — 1)/A. 
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From Equation (18.11), s can be rearranged as follows: 


s = (A + 4) n_1 e = [A"- 1 , (n - 1)A"- 2 , . . . , l]e. (18.13) 

Differentiating the preceding equation and plugging in from Equation (18.10), 
we obtain 

^ + [0, A n_1 , (n — l)A n-2 , • • • , A]e 

= f-F + v- [k n , k n -i,- • • , k\]e + [0, A n_1 ,(n - l)A n-2 , • • • , A]e. 

(18.14) 

By setting [k n , k n - 1 , • • • , ki] = [0, A n_1 , (n — l)A n_2 , • • • , A], we have 


ds 

dt 


f-F + v , 


and 



= (f - F + v)sgn(s). 

That is, Equation (18.12) is satisfied if and only if 

(/ - F + v)sgn{s) < -ry. 

If we assume that the approximation error \ f — F\ is bounded by a positive number 
A , then the preceding equation is always satisfied if 

v = -(A + rf)sgn(s). 

In summary, if we choose the control law as 

u(t) = -F(x,p) + x + [0,A n -\ (n - l)A n-2 ... A]e - (A + rj)sgn(s), 

where F(x,p) is an adaptive network that approximates /(x) and A is the error 
bound, then the closed-loop system can achieve perfect tracking asymptotically with 
global stability. 

This approach uses a number of nonlinear control design techniques and pos- 
sesses rigorous proofs for global stability. However, its applicability is restricted to 
feedback linearizable systems. The reader is referred to ref. [27] for a more detailed 
treatment of this subject. Applications of this technique to neural and fuzzy control 
can be found in refs. [26] and [33], respectively. 

18.6 SUMMARY 

This chapter presents four design techniques for neuro-fuzzy controllers: reinforce- 
ment learning, derivative-free optimization (GA in particular), gain-scheduling ap- 
proaches, and feedback linearization in conjunction with sliding control. Note that 
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we are not attempting to present an exhaustive review here. Some other design and 
analysis approaches do exist, though they are less common than those described 
in this chapter. Some of the methods not described here include cell-to-cell map- 
ping techniques [6, 28], the model-based design method [30], and self-organizing 
controllers [25, 31]. 


EXERCISES 

1. The crossover operator in Example 18.1 preserves the total number of MFs. 
Explain why briefly. 

2. Redefine the crossover operator in Example 18.1 such that the child FISs have 
the same numbers of MFs as their parents. 

3. Redefine the mutation operator in Example 18.1 such that the number of rules 
is preserved. 

4. In Example 18.2, we always choose the longest cut when there is more than 
one way to execute a cut. Redraw Figure 18.7 for a policy that chooses the 
shortest cut. 

5. Devise a coding scheme for a multilayer perceptron and define its crossover and 
mutation operators. 
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19.1 INTRODUCTION 

This chapter describes several applications of ANFIS to a variety of domains. Some 
of these representative applications employ real-world data, some use synthetic data; 
all of them shed light on how to tackle different tasks with similar natures. 
Applications covered in this chapter fall into several categories: 

• Pattern recognition: printed chaxacter recognition 

• Robotics: inverse kinematics 

• Nonlinear regression: automobile miles per gallon (MPG) prediction 

• Nonlinear system identification: furnace modeling 

• Adaptive signal processing: channel equalization and noise cancellation. 

ANFIS applications to automatic control axe not described here; this is covered 
extensively in Chapters 17 and 18. 

19.2 PRINTED CHARACTER RECOGNITION 

In this section, we describe a straightforward design method for a fuzzy inference 
system to solve pattern recognition problems; this is based loosely on the concept of 
nearest-neighbor classification or case-based reasoning. Iterated training is 
not mandatory for this design method, but the method does require some represen- 
tative, noise-free data points from the recognition system to be modeled. We revisit 
the Exclusive-OR (XOR) problem to demonstrate the concept behind this design 
method, and then we apply the method to printed chaxacter recognition problems. 

As explained in Section 9.2.2, to solve a binary XOR problem, we need to classify 
a binary input vector to class 0 if the vector has an even number of Is; otherwise, 
it is assigned to class 1. The desired behavior of the two-input XOR problem is 
described by the following truth table: 


503 



504 


ANFIS Applications Ch. 19 


o: class 1 , x: class 2 




(a) (b) 


Figure 19.1. (a) Training data for XOR problem (MATLAB file: xordata.m); (b) 
Z and S Membership functions for “near 0” and “near 1,” respectively. (MATLAB 
file: xormf.m) 



X 

Y 

Class 

Desired i/o pair 1 

0 

~0~ 

0 

Desired i/o pair 2 

0 

1 

1 

Desired i/o pair 3 

1 

0 

1 

Desired i/o pair 4 

1 

1 

0 


From the training data plot in Figure 19.1(a), it is obvious that the XOR problem 
is not linearly separable and cannot be solved by a single-layer perceptron. To use 
an MLP (multilayer perceptron) with a hidden layer to solve it, we need to train 
the network. 

By noting that these training data are representative and noise free, we can 
use them as prototypes for the fuzzy logic design approach based on nearest- 
neighbor classification or case-based reasoning (see also the interpolation RBFN in 
Section 9.5.3). For a given set of prototypes, the underlying rationale for classifying 
a new data point is simple: Find the prototype nearest to the new data point and 
assign the point to that prototype class. To do this, we need a similarity measure 
that quantifies the meaning of near. This is done in terms of membership functions 
(MFs). In Figure 19.1(b), for instance, the meaning of “near 0” and “near 1” can 
be expressed as the Z and S MFs, respectively. 

We still need to know the meaning of closeness between the input data [x, y ] 
and one of the prototypes, say, [0, 1]. If we take “[ x , y\ is near [0, 1]” to mean 
that “x is near 0 AND y is near 1,” then all we need to do is assign an appro- 
priate operator to AND. The most popular fuzzy AND operators are “product” 
and “min” . Figure 19.2 demonstrates the use of “product” and “min” in generating 
two-dimensional MFs for “[x, y] is near [0, 1]”. The use of “product” generates con- 
centric contours, while the use of “min” generates square contours. See Figure 19.2 
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Figure 19.2. Two-dimensional MFs for ‘\x, y\ is near [0, 1],” with the use 
of “product” (left column) and “min” (right column) for the fuzzy AND operator. 
(MATL AB file: xor 2dmf . m) 


for the composite two-dimensional MFs and their contours. 

Creating a fuzzy rule set for solving the XOR problem is now obvious: 

Rule 1: IF x is near 0 AND y is near 0 THEN output = 0. 

Rule 2: IF x is near 0 AND y is near 1 THEN output = 1. 

Rule 3: IF x is near 1 AND y is near 0 THEN output = 1. 

Rule 4: IF x is near 1 AND y is near 1 THEN output = 0. 

In other words, if input data [x, y] is close to one of the prototypes, it is then 
assigned to that prototype class. We can display the input-output behavior of 
the constructed fuzzy inference system as a two-dimensional surface, as shown in 
Figure 19.3. 

Now we can move on to a more challenging problem: printed character recogni- 
tion (PCR), in which each of 26 letters is defined as a 7 x 5 pixel matrix, as shown 
in Figure 19.4. The challenge is to build a fuzzy inference system that can classify 
a given set of 35 (= 7 x 5) pixels to one of the 26 alphabet characters. Again, these 
26 prototypes are noise free, and we can employ the concept referred to previously 
in designing a fuzzy inference system. 
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Figure 19.4. Twenty six printed alphabet characters. (MATLAB file: pchar.m) 


• Construct MFs for each of the 35 inputs. Note that in the prototypes, each 
pixel is either 0 or 1, so we can set up MFs for “near 0” and “near 1” in the 
same way as we did in Figure 19.1(b). 

• Set up rules. Each prototype represents a rule, so we have 26 rules, each of 
them an AND rule with 35 preconditions. Each rule’s output is not critical, 
and we can set it to be an arbitrary constant (in a Sugeno fuzzy model) or 
MF (in an Mamdani fuzzy model). 

• Use the fuzzy inference system. Note that the output is a categorical variable 
for which no numerical order is assumed, so it would be wrong to interpret 
the final numerical output of the fuzzy inference system directly. (The final 
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output is wrong anyway since we set the outputs of all rules to be an arbitrary 
constants or MFs.) Instead, distance measure information is embedded in each 
rule’s firing strength — the larger the firing strength of a rule is, the closer a 
given input is to the prototype of that rule. Therefore, we obtain 26 firing 
strengths; the alphabet corresponding to the maximal firing strength is then 
selected as the predicted class. 

To test the fuzzy inference system, we can assign various noise levels to the 
input pattern, as shown in Figure 19.5. The fuzzy PCR system thus obtained 
performs comparably to a similar system using an MLP (multilayer perceptron). 
Other factors that make fuzzy PCR a better choice are as follows: 

• It does not required any training. 

• It is a knowledge representation, and each rule in the system represents our 
insight into the problem we want to solve. 

In this approach, we did not use any optimization schemes. We may invoke 
derivative-based (Chapter 6 or derivative-free (Chapter 7) optimization techniques 
if the described approach fails to classify noisy characters recognizable by humans 
correctly. Since the described method already gives us a roughly correct fuzzy 
inference system, the training time required to fine-tune membership functions is 
likely to be much shorter than that for an MLP starting with random weights. 

Note that training a fuzzy inference system for pattern recognition is not exactly 
the same as the ANFIS training described in Chapter 12 . Taking the fuzzy inference 
system used here as an example, we axe only interested in the firing strengths, not 
the final outputs after weighted average of defuzzification. Therefore, the error 
measure should be a function of the discrepancy between desired and actual firing 
strengths; there is no need to calculate the final output of the fuzzy inference system. 
This corresponds to tuning premise parameters only, and it is faster than the full- 
scale ANFIS training introduced in Chapter 12. 

19.3 INVERSE KINEMATICS PROBLEMS 

In this section, we use ANFIS to model the inverse kinematics of the two-joint 
planar robot arm shown in Figure 19.6. This problem involves learning to map 
from an endpoint Cartesian position ( x,y ) to joint angles (0i, 02), and it requires 
that the end effector (“hand”) be able follow the reference signal without being 
given the joint angles. The forward kinematics equations from (0i,02) to (x, y) are 
straightforward : 

( x = h cos(0i) + 1 2 cos(0i + 0 2 ), 

\ y = /1 sin(0i) +Z 2 sin(0i +0 2 ), 

where l\ and I 2 are arm lengths; and 0i and 62 are their respective angles (see 
Figure 19.6). However, the inverse mappings from ( x,y ) to (0i,02) are not as clear. 
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End Effector 



Figure 19.6. Two- joint planar robot arm. 

In this case, it is possible to find the inverse mappings algebraically, but the solutions 
are not generally available for a multiple-joint robot arm in 3-D space. Instead of 
solving the equations directly, we use two ANFIS systems to learning these inverse 
mappings. Figure 19.7 demonstrates the forward mappings from ( 6 i, 6 2 ) to (x,y) 
(the first row) and the inverse mappings from (x,y) to ( 61 , 62 ) (the second row). 
Here we assume that Zi = 10, I 2 = 7, and the values of 62 are restricted to [0,7r]. 
Note that when y/x 2 +y 2 is greater than Zi + I 2 or less than |Zi — Z 2 I, there is no 
corresponding ( 61,62 )• This is called the unreachable workspace, and it can be 
seen clearly in the plots in the second row of Figure 19.7. 
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Figure 19.7. Direct (the first row) and inverse (the second row) kinematics of a 
two-joint planar robot arm. (MATLAB file: invsurf .m^ 


From the first quadrature, we collected 229 training data pairs of the form 
( x , y, 6 i) and ( x , y, 62 ), respectively, to train two ANFIS systems. (This corresponds 
to the M ANFIS architecture described in Section 13.2.) We used three MFs for each 
input; thus there were nine rules and 45 parameters for each ANFIS. We trained 
both ANFIS systems for 50 epochs. Figure 19.8 shows the test results for when 
an ellipse was chosen as the reference path. The dashed line shows how the end 
effector follows the path based on the inverse mappings learned by the two ANFIS 
systems; the crosses indicate the locations of the training data. Note that as long as 
the ellipse was inside the region covered by the training data, both ANFIS systems 
performed almost perfectly in following the desired trajectory. However, when some 
part of the ellipse was outside the region covered by the training data, the robot arm 
behaved unpredictably when the desired trajectory reached the “untrained” parts. 
This shows that ANFIS is very good at interpolation when data are abundant, but 
not so good at extrapolation when data are scarce. This phenomenon is common 
to all regression models, and it is more important to understand your data rather 
than select a fancy model. 

The approach described here is similar to the inverse control introduced in Chap- 
ter 17; we can apply other on-line approaches to force the end effector to follow the 
trajectory more closely over time. 
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Figure 19.8. Trajectory following of a two-joint robot arm, where the crosses 
indicate the training data locations, the solid line is the desired trajectory, and the 
dashed line is the actual trajectory. You can move the ellipse by clicking inside it 
and dragging it around. (MATLAB file: invkine.mj 

19.4 AUTOMOBILE MPG PREDICTION 


This section describes the use of ANFIS for nonlinear regression. In particular, 
we address the issue of input selection for finding important input variables and 
reducing training data dimensions. We shall use automobile MPG (miles per gallon) 
prediction as a case study, in which an automobile’s fuel consumption in terms of 
MPG is predicted by ANFIS based on several given characteristics, such as number 
of cylinders, weight, model years, and so on. 

The automobile MPG prediction problem is a typical nonlinear regression prob- 
lem, in which several attributes (input variables) are used to predict another con- 
tinuous attribute (output variable). In this case, the six input attributes includes 
profile information about the automobiles: 


No. of cylinders: 

Displacement: 

Horsepower: 

Weight: 

Acceleration: 

Model year: 


multi-valued discrete 

continuous 

continuous 

continuous 

continuous 

multi-valued discrete 


The attribute to be predicted in terms of the preceding six input attributes is 
the fuel consumption in MPG. Table 19.1 provides a list of seven instances selected 
at random from the data set. 






511 


Sec. 19.4. Automobile MPG Prediction 


Table 19.1. Samples of the MPG training data set. (The last column is used for 
reference only and not for prediction.) 


Cyl. 

Disp. 

HP 

Weight 

Accel. 

Year 

MPG 

Car name 

8 

307 

“130“ 

3504 

12 

70 

18 

Chevrolet Chevelle Malibu 


198 

95 

2833 

15.5 

70 

22 

Plymouth Duster 

4 

90 

WM 


mm 

Kfl 

24 

Fiat 128 

8 


By 


■■ 

mOB 

17 

Oldsmobile Cutlass Supreme 

4 

89 



mk&m 

Bi 

37.7 

Toyota Tercel 

4 

107 

75 

2205 

14.5 

82 

36 

Honda Accord 

4 

120 

79 

2625 

18.6 

82 

28 

Ford Ranger 


The data set is available from the UCI (University of California at Irvine) Repos- 
itory of Machine Learning Databases and Domain Theories 1 . More historical in- 
formation about the data set can be found there. After removing instances with 
missing values, the data set was reduced to 392 entries. Our task was then to 
use this data set and ANFIS to construct a fuzzy inference system that could best 
predict the MPG of an automobile given its six profile attributes. 

To apply ANFIS to MPG prediction, we needed to take care of two problems 
first: the scarcity of data points and the style of input space partition. 

• Data scarcity: For a single-input data-fitting problem of medium complex- 
ity, we usually need 10 data points to come up with a good model. Similarly, 
for a two-input data-fitting problem, we need 10 2 = 100 data points to get 
approximately the same performance. Therefore, for a six-input problem, 
such as the MPG prediction, ideally we should have 10 6 = 1,000,000 data 
points. However, this is prohibitively large for any common modeling prob- 
lem. Considering that we have only 392 data instances (which corresponds 
to \/392 = 2.5 data points for single-input data fitting), the use of these 
data becomes an important issue. This data scarcity dilemma is ubiquitous 
in multivariate regression. A commonly used solution, already explained in 
Chapter 12, is to divide the data set into training and test data sets; the 
training set is used for model building, while the test set is used for model 
validation. Thus, the resultant model is not biased toward the training data 
set and it is likely to have a better generalization capacity to new data. 

• Input space partitioning: Grid partitioning is the most frequently used 
input partitioning method. However, for a problem with six inputs, grid 
partitioning leads to at least 2 6 = 64 rules, which results in (6 + 1) x 64 = 448 
linear parameters if we want to stick to the first-order Sugeno fuzzy model. 
This implies that we have too many fitting parameters, and the resultant 


ipTP address: ftp://ics .uci.edu/pub/machine-leeuming-databases/auto-mpg 



512 


AN FIS Applications Ch. 19 


Training (Solid Line) and Test (Dashed Line) Errors 



Figure 19.9. Fifteen two-input fuzzy models for automobile MPG prediction. 
(MATLAB file: mpgpick2) 


Training (Solid) and Test (Dashed) Error Curves 



Figure 19.10. Error curves obtained by training a fuzzy inference system to predict 
MPG. (MATLAB file: mpgtrain.m) 


model is not reliable for unforeseen inputs. To deal with this, we can either 
select certain inputs that have more prediction power instead of using all the 
inputs, or choose tree or scatter partitioning using the structure identification 
techniques described in Chapters 14 and 15, respectively. Here we consider 
only input dimension reduction. 

Before training a fuzzy inference system, we divide the data set into training and 
test sets. The training set is used to train (or tune) a fuzzy model, while the test 
set is used to determine when training should be terminated to prevent overfitting. 
The 392 instances are randomly divided into training and test sets of equal size 
(196). 

If we only want to select the two most relevant inputs as predictors, we can 
cycle through all the inputs and build = 15 fuzzy models trained by the anf is 
command in the Fuzzy Logic Toolbox. The anf is command utilizes iterative op- 
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(a) 


Training (o) and Taat (x) Data 



Figure 19.11. Membership functions in chaotic time series prediction: (a) ANFIS 
surface for MPG prediction (MATLAB file: mpgtrain.m); (b) training and checking 
data distribution. (MATLAB file: mpgdata.m) 


timization techniques to fine-tune parameters and the training process could be 
lengthy. Fortunately, an efficient least-squares method is employed in the inner 
loop of anf is, and the performance after the first epoch is usually a good index of 
how well the fuzzy model will perform after further training. Based on this heuristic 
observation, we built 15 fuzzy models each with a single epoch of ANFIS training; 
the results are shown in Figure 19.9, with two curves representing training and 
test RMSE (root-mean-squared errors) . We reordered these 15 models according to 
their training errors. Obviously, the best model takes “weight” and “model year” as 
the input variables, which is reasonable. In this case, both error curves are more or 
less consistent; this implies that the training and test data were evenly distributed 
across the original data set. In particular, we will end up with the same model if 
we pick the one with the smallest test error. Note that Figure 19.9 is based on only 
one epoch of training; more reliable results can be obtained if more training epochs 
are allotted to each of the 15 models. 

Once we have selected the model with “weight” and “model year” as inputs, 
we can refine its performance via extended training by the anf is command. Fig- 
ure 19.10 shows the error curves for 100 epochs of training. The training error 
decreases all the way, but the test error, after decreasing initially, reaches a plateau, 
oscillates a bit, and then increases. Usually we use the test error as a true measure 
of the model’s performance; therefore, the best model we can achieve occurs when 
the test error is minimal. This corresponds to the circle in Figure 19.10; although 
further training beyond this point decreases the training error, it will degrade the 
performance of the fuzzy inference system on unforeseen inputs. 
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As a comparison, we now look at the result of linear regression, where the model 
is expressed as 

MPG = a® + ai * cyl + a ,2 * disp + * hp + <24 * weight 4- <25 * accel + clq * yeax, 

with ao, ai, ■ ■ ae being seven modifiable linear parameters. The optimum values 
of these linear parameters were obtained directly by the least-squares method de- 
scribed in Chapter 5; the training and test errors are 3.45 and 3.44, respectively. In 
contract, after 100 epochs of training, the minimal test error is 2.98, at which the 
training error is 2.61. It is worth noting that the linear model takes all six inputs 
into consideration, but the error measures are still high since MPG prediction is 
nonlinear. On the other hand, our input selection technique of choosing the two 
most relevant inputs can result in a nonlinear mapping with lower error measures. 

Figure 19.11(a) is a three-dimensional surface of the fuzzy model with the small- 
est test error. This is a smooth nonlinear surface, but it raises a legitimate question: 
Why does the surface increase toward the right upper corner? This is an appar- 
ently spurious result that states that heavy old cars have higher MPG ratings. 
The anomaly can be explained by the scatter plot of the data distribution in Fig- 
ure 19.11(b), in which it is obvious that the lack of data (due to the tendency of 
automobile manufacturers to begin building small compact cars instead of big heavy 
ones during mid-1970s) is responsible. In other words, our trained fuzzy inference 
system is good at interpolation, but not at extrapolation, as explained in the pre- 
vious section. Therefore, it is advisable for us to understand the data and qualify 
the scope of their validity before interpreting ANFIS output. 

19.5 NONLINEAR SYSTEM IDENTIFICATION 

This section applies ANFIS to nonlinear system identification, using the well-known 
Box and Jenkins gas furnace data [1] as the training data set. This is a time-series 
data set for a gas furnace process with gas flow rate u(t) as the furnace input 
and CO 2 concentration y(t) as the furnace output. We want to extract a dynamic 
process model to predict y (t) using 10 candidate inputs to ANFIS: y(t — 1 ), y(t — 2 ), 
y(t — 3), y(t — 4), u(t — 1), u(t — 2), u(t — 3), u(t — 4), u(t — 5), and u(t — 6 ). The 
original data set contains 296 [u(t),y(t)] data pairs; converting the data so that 
each training data point consists of [y(t — 1), • • • , y(t — 4), u(t — 1), • • • , u(t — 6); y(t)] 
(the last one is the desired output) reduces the number of effective data points to 
290. We use the first 145 data points as the training set, and the remaining 145 as 
the test set. 

Since we have 10 candidate input variables for ANFIS, it is reasonable to do 
input selection first to rate variable priorities and reduce the input dimension. For 
dynamic system modeling, the inputs selected for ANFIS must contain elements 
from both the set of historical furnace outputs {y{t — 1), y(t — 2), y(t — 3), y(t — 4)} 
and the set of historical furnace inputs { u(t — 1) , u(t — 2) , u(t — 3) , u(t — 4) , u(t — 5) , 
u{t — 6 )}. For simplicity, we assume that there are two inputs for ANFIS: one is 
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Training (Solid line) and Test (Dashed Line) Errors 



Figure 19.12. Input selection for Box- Jenkins data. . (MATLAB file: bjpick2.m) 


from the historical furnace outputs, the other from the historical furnace inputs. 
In other words, we have to build 24 (= 4 x 6) ANFIS models with various input 
combinations, and then choose the one with the smallest training error for further 
parameter-level fine-tuning. We could have chosen the ANFIS with the smallest 
test error, but this would have led to indirect training on test data. Figure 19.12 
shows the performance of these 24 ANFIS models; they are listed according to their 
training errors. Note that each ANFIS has four rules, and the training took only 
one epoch each to identify linear parameters. If computing power is not a problem, 
we could then assign more training epochs to each ANFIS. 

In Figure 19.12, we can see that the ANFIS with y(t — 1) and u(t - 3) as 
inputs has the smallest training error, so it is reasonable to choose this ANFIS for 
further parameter tuning. Figure 19.13 shows the result of training this ANFIS 
for 100 epochs. In particular, Figure 19.13(a) display the training and test error 
curves; the optimal ANFIS parameters were obtained at the time when the test 
error reached the minimum indicated by a small circle. Figure 19.13(b) shows the 
data distribution; it demonstrates that the training and test data do not cover the 
same region. Better performance can be expected if they cover roughly the same 
region; this can be achieved by using other schemes to divide the original data set. 
(For instance, the training and test sets can be interleaved in the original data set.) 
Figure 19.13(c) displays the desired curve and ANFIS prediction; the performance 
for time index from 1 to 145 is better since this is the domain from which the 
training data were extracted. Figure 19.13(d) is the ANFIS surface; it is cut off at 
the maximum and minimum of the desired output. 

Our input selection criterion for nonlinear system identification is straightfor- 
ward and provides satisfactory results. Other more advanced criteria and model 
validation techniques can generate more accurate results; see ref. [4] for details. 
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(a) Error curves 
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u(t-3) * y(t— 1 ) 


Figure 19.13. ANFIS for Box- Jenkins data: (a) training and checking error 
curves; (b) training and checking data distribution; (c) desired system response 
and ANFIS prediction; (d) ANFIS surface. . (MATLAB file: bj train. m) 


19.6 CHANNEL EQUALIZATION 

This section introduces the channel equalization problem, which arises frequently in 
dispersive digital communication channels, and proposes a way of using ANFIS to 
tackle this specific classification problem in signal processing. Since ANFIS employs 
an efficient least-squares method, little training time is required required to do the 
task. 

The concept of a digital communication system is simple: A sequence of binary 
signals s(t) is transmitted from one place to another via a communication channel. 
Ideally, the signal x(t) at the receiving end should be exactly the same as s(t), with 
a slight time delay. However, complications emerge in the real world because 

• communication channels are never perfect; cross coupling, interference, and 
attenuation tend to disperse and weaken signals during transmission; 

• noise is everywhere and is easily added to the transmitted signals. 

Figure 19.14 is a schematic diagram of a digital communications system in which 
a random binary sequence s(t) is transmitted through a linear, dispersive channel 
denoted by H(z), and then corrupted by additive noise e(t). The term s(t) is usually 
assumed to be an independent sequence with an equal probability of being —1 or 
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Figure 19.14. Data transmission system and channel equalizer. 


1. The task of the channel equalizer in Figure 19.14 is to estimate the input signals 
using the information contained in the observations x(£) = [x(t), . . . , x(t — m + 1)], 
where m is known as the order of the equalizer. Often a delay d is introduced into 
the equalizer, so that at time t, the equalizer estimates the input signal at t — d. 
When m = 2 and d = 1 , the task of the channel equalizer becomes one of estimating 
the input signal s(t) by using the observed signals x(t) = [#(£),#(£ — 1)]. Without 
loss of generality, we assume that m = 2 and d = 0 throughout the following 
discussion. 

When there is no noise, let us define 

P+ = {x £ R 2 \s(t) = 1}, 

P_ — {x € R 2 \s(t) = —1}. 

P + and P- represent two sets of possible channel output vectors that can be pro- 
duced from sequences of channel inputs containing s(t) = 1 and —1, respectively. 
It is easy to show that both P + and P_ are finite in size, since we assume that 
there is no noise. The task of the equalizer is to decide whether an observation x(£) 
represents a noise-corrupted version of an element in either P + or P_ , and thus to 
determine the input signal s(t). 

One structure often used for this purpose is the linear transversal equalizer, 
which estimates the input signal s(t) by 

s(t) = sgn (b + kox(t ) + k\x{t — 1)), 


where sgn(-) is the signum function defined by 



if V > 0, 
if y < 0. 


A linear transversal equalizer can estimate the input signal s(t ) correctly, if and 
only if sets P+ and P_ are linearly separable -that is, we can use a single straight 
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(a) Minimum Phase Channel 


(b) Nonminimum Phase Channel 



Figure 19.15. Channel outputs without noise: (a) minimum phase channel H(z) = 
1.0+0.8z~ 1 +0.5z~ 2 ; (b) nonminimum phase channel H{z) = 0.5 +z _1 . (MATLAB 
file: nonoise. m) 


line to separate P+ from P_ . Suppose that the communication channel is modeled 
as a finite impulse response filter with the following transfer function: 

k 

H(z) = do + a\z~ x + F akZ~ k = ^ \ajZ ~ k . 

i - o 

Then for P + and P_ to be linearly separable, the roots of the polynomial 

Q>0Z k -F QiZ k ^ + • • • -+- fljfe 

must lie strictly within the unit circle on the complex plane. Linear channels whose 
transfer functions satisfy this condition are referred to as minimum phase; other- 
wise, they are said to be nonminimum phase. For instance, H(z) = 1.0-F0.8z _1 -F 
0.5z -2 is minimum phase since the corresponding roots, —0.4 ± 0.58«, lie strictly 
within the unit circle on the complex plane. We can plot P+ and P_, denoted as 
“o” and “x” , respectively, on a two-dimensional plane, as shown in Figure 19.15(a). 
The plot indicates that P+ and P_ are linearly separable, and one possible decision 
boundary is shown. The nonminimum-phase channel H(z ) = 0.5 + z~ x shown in 
a similar plot in Figure 19.15(b) illustrates that P + and P_ are not linearly sep- 
arable, and we cannot expect a linear transversal equalizer to solve the problem 
satisfactorily. 

If the channel is indeed represented by a finite impulse response 

n 

H{z) - ^CLiZ~\ 

i= 1 

and the additive noise e(t) is a Gaussian sequence, then an optimal equalizer with 
a minimum bit error rate can be derived via the concept of the two-state Bayes 
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decision rule [5]: 

s(t) = sgn (f de (x(t))) 

= sgn(/+(x(t)) -/-(x(t))), 

where /+(x) and /-(x) are the conditional density functions of x(t) given s(t) = 1 
and —1, respectively. Symbolically, 

/+( x ) = Ex + GP+ eX P(-7( X - X +) TS_1 (X-X+)), 

/-( x ) = Ex_gp_ ex P(-2( x - x -) Ts_1 ( x - x -))> 


where S is the covariance matrix of the Gaussian noise e(t): 


S = 


E[e\t)] E[e(t)e(t - 1)] ' 

E[e(t — l)e(t)] E[e 2 (t — 1)] ’ 


and x + and x_ are elements in P + and P_, respectively. Therefore, the optimal 
decision boundary consists of the set of points 


{x G P 2 |/d e (x) = 0}. 


For the nonminimum phase channel characterized by H(z ) = 0.5+z -1 , the opti- 
mum decision surface and boundary are shown in Figure 19.16, where the covariance 
matrix is assumed to be 

0.2 O' 

0 0.2 ' 

It is obvious that if the noise source is of an unknown nature instead of being 
uncorrelated Gaussian noise, the decision boundary will be affected in a way that 
cannot be directly analyzed. Moreover, if the channel characteristics are not linear, 
then the preceding formulation of an optimal decision boundary is not valid. 

We shall use ANFIS to approximate the optimal decision boundary in a channel 
equalization problem in which the channel characteristics are nonminimum and are 
described by H(z ) = 0.5 + z -1 . Since ANFIS estimates the decision boundary 
by sample data directly, we do not need any assumptions about the nature of the 
channel characteristics or the noise signal. 

Before training a fuzzy inference system, we need to collect sample data first. 
This is done by feeding a sequence of binary random signals s(t ) into the impulse 
response function defined by 


H{z) = 0.5 + z" 1 . 

Figure 19.17 plots the signals involved in this data collection step; there are 500 total 
training data points. Since the order of the equalizer is 2, we can display the data as 
a 2-D scatter plot, as shown in Figure 19.18(a). [Figures 19.18(b) through 19.18(d) 
are explained later in this section.] 

For simplicity, we trained ANFIS for only a single epoch. This corresponds to 
identifying output coefficients of fuzzy rules in a Sugeno-style fuzzy inference system 
using the least-squares method; the membership functions (MFs) are not modified. 
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Optimal Decision Surface Threshold = 0 



x(t) 


Figure 19.16. Optimum decision surface and boundary for a nonminimum phase 
channel characterized by H(z) = 0.5 + z _1 . (MATLAB file: optideci.m) 


A four-rule ANFIS (with two MFs on each input) after one training epoch ex- 
hibits the surface shown in Figure 19.19(a). Figure 19.19(b) illustrates how to do 
thresholding at zero; 19.19(c) is the surface after thresholding; and 19.19(d) is the 
decision boundary. It is obvious that with only a single epoch of training, ANFIS 
can construct a nonlinear decision boundary to separate P+ and P- correctly. 

If we increase the number of MFs on each input, the number of rules also 
increases and the performance should improve since we are allowing the ANFIS 
equalizer more degrees of freedom to match the given training data. Figure 19.20 
demonstrates similar plots for a nine-rule ANFIS. The performance is slightly better 
than that of the four-rule ANFIS, but we can also see some spurious decisions on 
the left-hand side of Figure 19.20(d). If we look back at Figure 19.18(a), it becomes 
clear that the wrong decisions made by ANFIS occurred in an area where training 
data is scarce. The lack of data leads to wrong predictions by the nine-rule ANFIS, 
which the four-rule ANFIS did not make because it had fewer parameters. On the 
other hand, we do need to have enough tuning parameters to match the optimal de- 
cision boundary. This is an inherent trade-off between modeling and generalization 
capabilities. 

The paucity of data can be defined as the data density according to the fol- 
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Figure 19.17. Signals in the training data set (MATLAB file: equdata.m) 


(a) Training Data Distribution 


(b) Training Data Density 




V it* ' 


2 x(t-1)-2 


(c) Density Contours 



Figure 19.18. Training data distribution and density. (MATLAB file: 
equdensi.m) 


lowing formula: 


density (x) = ]T exp(-- ^ X ), 


d i£D 


where dj’s are the input portions of the training data, and a is an effective radius. 
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ANFIS Surface: 4 Rules Threshold = 0 



Figure 19.19. four-rule ANFIS equalizer. (MATLAB command: eqtrain(2)) 


After normalizing the density function to within [0,1], the data density and its 
contours are as shown in Figure 19.18(b) and 19.18(c); Figure 19.18(d) is a combined 
plot. Since the noise is Gaussian around vectors in P + and P_, we can clearly 
identify the peaks of the density function and define the confidence region of the 
ANFIS equalizer as the internal area surrounded by the the contour at height = 
0.05. By superimposing the contour onto the decision boundaries obtained earlier, 
we obtain Figure 19.21. It is now clear that the wrong predictions made by the 
nine-rule ANFIS are definitely outside the confidence region. In general, when 
input vectors fall outside the ANFIS confidence region, we should re-examine its 
results using information from other sources. 

In summary, ANFIS is effective in solving channel equalization problems, how- 
ever, its results should not be taken too literally — understanding the data is always 
crucial. 

Other fuzzy modeling and neural network approaches to channel equalization 
problems can be found in refs. [2, 3, 6]. 
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ANFIS Surface: Nine Rules 


Threshold = o 



-2 x(t) 

Decision Surface after Thresholding 



x(t-1) -2 



x(t-1) -2 


-2 x(t) 
Decision Boundary 



Figure 19.20. nine-rule ANFIS equalizer. (MATLAB command: eqtrain(3)) 


Optimal Boundary 


4-Rule ANFIS Boundary 9-Rule ANFIS Boundary 



Figure 19.21. Decision boundaries plus confidence region. (MATLAB file: 
equdec .m) 


19.7 ADAPTIVE NOISE CANCELLATION 

Adaptive noise cancellation was first proposed by Widrow and Glover in 1975 [7]; 
the objective is to filter out an interference component by identifying a linear model 
between a measurable noise source and the corresponding unmeasurable interfer- 
ence. Adaptive noise cancellation using linear filters has been used successfully in 
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x(k) =>< 

Information Signal + 

(Not Measurable) 


y(k) = x(k) + d(k) 
Detected Signal 
(Measurable) 


n(k) -= 
Noise Source 
(Measurable) 


(a) 


_| d(k) 

Distorted noise 
(Not Measurable) 



(b) 

Figure 19.22. Schematic diagram of noise cancellation: (a) without ANFIS filter- 
ing; (b) with ANFIS filtering. [Note that the inputs to blocks f and ANFIS could 
contain past values of n(k) not shown here.] 


real-world applications such as interference canceling in electrocardiograms (ECGs) , 
echo elimination on long-distance telephone transmission lines, and antenna sidelobe 
interference canceling [8]. 

It is obvious that we can expand the concept of linear adaptive noise cancella- 
tion into the nonlinear realm by using nonlinear adaptive systems. In this section, 
we shall show how ANFIS can be used to identify an unknown nonlinear passage 
dynamics that transforms a noise source into an interference component in a de- 
tected signal. Under certain conditions, the proposed approach is sometimes more 
suitable than noise elimination techniques based on frequency-selective filtering. 

Figure 19.22(a) shows the schematic diagram of an ideal situation to which 
adaptive noise cancellation can be applied. Here we have an unmeasurable infor- 
mation signal x(k) and a measurable noise source signal n(k); the noise source goes 
through unknown nonlinear dynamics to generate a distorted noise d(k), which is 
then added to x(k) to form the measurable output signal y(k). Our task is to re- 
trieve the information signal x(k) from the overall output signal y{k), which consists 
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of the information signal x(k ) plus d(k), a distorted and delayed version of n(k). 

An example of noise cancellation is the suppression of maternal ECG component 
in fetal ECG [8]. Suppose that we want to measure the fetal ECG x(k) during labor. 
If we record signals from a sensor placed in the abdominal region, the obtained signal 
y(k) is inevitably noisy due to the mother’s heartbeat signal n(fc), which can be 
measured clearly via a sensor at the thoracic region. However, the heartbeat signal 
n(k ) does not appear directly in y(k). Instead, n(k) travels through the mother’s 
body and arrives delayed and distorted to appear in the overall measurement y(k). 
In symbols, the detected output signal is expressed as 

y(k) = x(k) + d(k) = x(k) + f(n(k),n(k — 1 ),n(k — 2), • • •). (19-1) 

The function /(•) represents the passage dynamics that the noise signal n(k) goes 
through. If /(•) were known exactly, it would be easy to recover the original in- 
formation signal by subtracting d(k) from y(k) directly. However, /(•) is usually 
unknown in advance and could be time varying due to changes in the environment. 
Moreover, the spectrum of d(k) may overlap that of x(k) substantially, invalidating 
the use of common frequency-domain filtering techniques. 

To estimate the distorted noise signal d(k), we need to pick up a a clean version 
of the noise signal n{k ) that is independent of the information signal. However, we 
cannot access the distorted noise signal d(k) directly since it is an additive compo- 
nent of the overall measurable signal y{k). Fortunately, as long as the information 
signal x(k) is zero mean and not correlated with the noise signal n(fc), we can use 
the detected signal y(k) as the desired output for ANFIS training, as shown in 
Figure 19.22(b). 

More specifically, let the output of ANFIS be denoted by d(k). The learning 
rule of ANFIS tries to minimize the error 

l|e(fc)|| 2 = \\y(k)-d(k)r 

= ||z(fc) + d(fc)-d(fc)|| 2 (19.2) 

= || x(k) + d(k) - f(n{k),n{k - l),n(fc - 2), • • -)|| 2 , 

where / is the function implemented by ANFIS. Since x(k) is not correlated with 
n(k ) or its history, ANFIS has no clue how to minimize the error component at- 
tributable to x. In other words, the information signal x serves as an uncorre- 
lated “noise” component in the data fitting processing, so ANFIS can do noth- 
ing about it except pick up its steady-state trend. Instead, the best that AN- 
FIS can do is to minimize the error component attributable to d(k) — that is, 
|| d(k) — f(n(k),n(k — 1 ),n(k — 2), • • -)|| 2 — and this happens to be the desired error 
measure; it is as if we could measure d(k) directly. To make this clear, we can 
expand Equation (19.2) to 

||e(fc)|| 2 = ||z(fc)|| 2 + ||d(A0 - d(k) || 2 + 2 x{k)d(k) - 2x(k)d{k). (19.3) 

Taking expectations from both sides of Equation (19.3) and realizing that x(k) is 
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Figure 19.23. Various signals for noise 
(b) noise signal n(k); (c) distorted noise 
y(k). (MATLAB file: noise 1 .m) 


(b) Noise Source Signal 
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cancellation: (a) information signal x(k); 
signal d(k); (d) measurable output signal 


not correlated with d(k ) yields 

E[e 2 ] = E[x 2 ] + E[(d - d) 2 ] - 2 E[xd\. (19.4) 

If x(k) is a random signal with a zero mean, then ANFIS has no way to model it 

■J A A 

and Yi x(k)d(k) approaches zero as n goes to infinity. This implies E[xd\ = 0, 
and we have 

E[e 2 ] = E[x 2 ] + E[{d - d) 2 ], (19.5) 

where E[x 2 ] is not affected when ANFIS is adjusted to minimize E[e 2 ]. Therefore, 
training ANFIS to minimize the total error E[e 2 ] is equivalent to minimizing E[(d — 
d) 2 ], such that the ANFIS function /(•) can be as close as possible to the passage 
dynamics /(•) in a least-squares sense. 

Note that x(k) is the information we want to recover, but it also serves as 
additive “noise” in ANFIS training. To simplify the following discussion, let us 
assume that 

1. x(k) is a zero signal for all k , and 

2. we have fixed premise parameters and updated consequent parameters of AN- 
FIS using the least-squares method. 

Assumption 1 implies that we can obtain perfect training data that are subject 
only to measurement noise; assumption 2 states that we are using ANFIS with 
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(a) Power Spectral Density of x(k) 



(c) Power Spectral Density of n(k) 



(b) Power Spectral Density of y(k) 
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Figure 19.24. Spectral density distributions: (a) information signal x(k); (b) noise 
signal n{k ); (c) distoiied noise signal d(k); (d) measurable output signal y(k), for 
k = 0 to 255. (MATLAB file: noisel.m) 


linear parameters only. Even with perfect training data, ANFIS (with modifiable 
linear parameters only) would produce a fitting error e(k) equal to the difference 
between a desired output and the ANFIS output; this error term is attributable 
to measurement noise and/or modeling errors. Statistically, if the error term e{k) 
is zero mean, then the consequent parameters ANFIS obtains via the least-squares 
method are unbiased. This is a well known property of the linear least-squares 
estimator (LSE) and is stated in the Gauss-Markov theorem in Section 5.7. 

We now want to relax our previous assumptions and see how well the Gauss- 
Markov theorem fits. Assumption 1 states that x{k) is a zero signal, which is 
unrealistic. Fortunately, however, x(k) is an additive component, and the new error 
term becomes e(k) + x(k). Therefore, as long as x(k) is zero mean, we can still 
identify unbiased consequent parameters using LSE. 

Assumption 2 requires ANFIS to update its consequent parameters only. In our 
simulations, we applied the proposed hybrid learning rule to update both premise 
(nonlinear) and consequent (linear) parameters. This made ANFIS a nonlinear 
model and the Gauss-Markov theorem no longer held. However, we do possess the 
capacity to reduce modeling errors further. 

If we replace the ANFIS block in Figure 19.22(b) with a linear filter, we then 
have the original adaptive linear noise cancellation settings proposed by Widrow 
and Glover in 1975 [7]. By enhancing the linear filter with a nonlinear ANFIS filter, 
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(a) Passage Characteristics (b) ANFIS Function 



Figure 19.25. Using ANFIS for noise cancellation: (a) actual nonlinear passage 
dynamics /(•); (b) ANFIS function f; (c) training data distribution; (d) RMSE 
curve. (MATLAB file: noisel .m) 


we are able to deal with a wide range of nonlinear passage dynamics. 

Before presenting simulation results, we shall reiterate the conditions under 
which adaptive noise cancellation is valid: 

• The noise signal n(k) should be available and independent of the information 
signal x(k). 

• The information signal x(k ) must be zero mean. 

• The order of the passage dynamics is known. (This determines the number of 
inputs to the ANFIS filter.) 

In our experiments, we applied ANFIS to two nonlinear passage dynamics of 
orders 2 and 3, respectively. In the first experiment, the unknown nonlinear passage 
dynamics were assumed to be defined as 


d(k ) = f(n(k),n(k — 1)) = 


sin (n(k)) n(k — 1) 
l + [n(fc-l)] 2 ’ 


(19.6) 


where n(k) is a noise source and d(k) denotes the resultant from the nonlinear 
passage dynamics /(•) attributable to n(k) and n(k — 1). Figure 19.25(a) displays 
function /(•) as a three-dimensional surface. Since /(•) is unknown, we use ANFIS 
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Initial MFs for n(k) 


Initial MFs for n(k-1 ) 



n(k) n(k— 1> 


Figure 19.26. MFs before and after training. (MATLAB file: noise l.m) 


to approximate this function under the assumption that we do know that /(•) is of 
order 2. 

We assume that the information signal x(k) is expressed as 

x(k) = sin (rS) ’ (197) 

where A: is a step count, and the sampling period is equal to 5 fj,s. Figure 19.23(a) 
demonstrates x(k) when k runs from 0 to 1,000 (or when time runs from 0 to 5 s). 
We assume that the measurable noise source is Gaussian with zero mean and unity 
variance, as shown in Figure 19.23(b). The resulting distorted noise d(k) produced 
by the nonlinear dynamics in Equation (19.6) is shown in Figure 19.23(c). The 
measurable signal at the receiving end, denoted as y(k), is equal to the sum of x(k) 
and d(k ) and is demonstrated in Figure 19.23(d). Due to the nonlinear passage 
dynamics of /(•) and the large amplitude of d(k), it is hard to correlate y(k) and 
x(k ) in the time domain. 

Before we move on, we should first examine how these signals behave in the 
frequency domain. Figures 19.24(a) through 19.24(d) demonstrate the spectral 
density distributions of x(k) } n(k ), d(k), and y{k), respectively, for the first 256 
points. Obviously, the spectra of x(k) and d(k) overlap each other considerably, 
making it impossible to employ frequency-domain filtering techniques to remove 
d(k) from y{k). 

To use ANFIS in this situation, we collected 500 training data pairs of the 
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Figure 19.27. Results of using ANFIS for noise cancellation: (a) ANFIS output 
d(k); (b) estimated information signal x(k); (c) estimation error x(k ) — x(k); (c) 
original information signal x(k). (MATLAB file: noisel.m) 


following form: 

[n(k),n(k - l);y(A;)], (19.8) 

with k runs from 1 to 500. We used a four-rule ANFIS to fit the training data, in 
which each of the two inputs was assigned two generalized bell membership func- 
tions. Figure 19.25(b) is the ANFIS surface /(•) after 20 epochs of batch learning; 
19.25(c) is the scatter plot of the training data; 19.25(d) is the RMSE (root-mean- 
squared error) curve through 20 epochs. The starting point of the RMSE curve 
shows the error when only the linear parameters have been identified by LSE. By 
also updating the nonlinear parameters, we were able to decrease the error further; 
Figure 19.26 shows the MFs before and after training, reflecting changes in premise 
(nonlinear) parameters. Note that the error cannot be minimized to zero; the min- 
imum error is regulated by the information signal x(k), which appears as fitting 
noise. 

By using ANFIS, the estimated resultant d from the nonlinear passage is ex- 
pressed as d(k) = f(n(k),n(k — 1)), as shown in Figure 19.27(a). Thus, the esti- 
mated information signal x{k), derived as y(k) — d(k ), is shown in Figure 19.27(b). 
The difference between x(k) and x(k) is shown in Figure 19.27(c). Note that x(k) 
is already fairly close to x(k ); the estimation error in Figure 19.27(c) is expected to 
decrease if more training data are used over more training epochs. 

In our second experiment, we used real-world audio signals for simulation. 
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Figure 19.28. Various signals for noise cancellation: (a) information signal x(k); 
(b) noise signal n(k); (c) distorted noise signal d(k); (d) measurable output signal 
y(k). (MATLAB file: noise2.m) 


The audio signals were obtained from the MATLAB sound files handel.mat and 
chirp. mat. When these two files axe loaded into MATLAB and played by the com- 
mand sound, handel.mat is a piece of music of composer George Frideric Handel’s 
“The Hallelujah Chorus” and chirp. mat the sound of a bird’s chirping. 

We used handel.mat as the information signal x(k) and chirp. mat as the noise 
source n(k). These audio signals were sampled at 8190 Hz. The nonlinear passage 
dynamics /(•) is represented by the following equation: 


d(k ) = f(n(k),n(k — 1 ),n(k — 2)) = 


8 sin (n(fc) n(k — 1) n(k — 2)) 

1 + [n(k - l)] 2 + [n(k - 2)] 2 ’ 


(19.9) 


where d(k ) is the distorted noise signal. Figure 19.28(a) demonstrates x(k) when k 
runs from 0 to 13,128 (or when time runs from 0 to 1.6 s). The measurable noise 
signal n(k ) is assumed to be the chirping sound, as shown in Figure 19.28(b). The 
corresponding distorted noise d(k) due to the nonlinear dynamics in Equation (19.9) 
is shown in Figure 19.28(c). The measurable signal at the receiving end, denoted 
as y(k), is equal to the sum of x(k) and d(k) and is shown in Figure 19.28(d). 

Figures 19.29(a) through 19.29(d) show the spectral density distributions for 
x(k), n(k ), d(k), and y(k), respectively, for the first 1,000 points. Again, the spectra 
of x(k) and d(k) overlap each other considerably, and we cannot use frequency- 
selective filters to remove d(k) from y(k). 
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(a) Power Spectral Density of x(k) 



(c) Power Spectral Density of n(k) 


(b) Power Spectral Density of y(k) 



(d) Power Spectral Density of d(k) 




Figure 19.29. Spectral density distributions: (a) information signal x(k); (b) noise 
signal n(k); (c) distorted noise signal d(k); (d) measurable output signal y(k), for 
k = 0 to 999. (MATLAB file: noise2.m) 


Training Error 



Figure 19.30. RMSE curve for ANFIS learning. (MATLAB file: noise2.m) 


To model /(•) using ANFIS, we collected 1,000 training data pairs of the follow- 
ing form: 

[n(k),n(k - 1 ),n(k - 2);y(k)], (19.10) 

with k runs from 2 to 1001. We used an eight-rule ANFIS to fit the training data, 
in which each of the three inputs was assigned two generalized bell membership 
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(a) Estimated Distorted Noise 
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Figure 19.31. Results of using ANFIS for noise cancellation: (a) ANFIS output 
d(k); (b) estimated information signal x(k); (c) estimation error x(k) — x(k); (c) 
original information signal x(k). (MATLAB file: noise2.m) 


functions. Figure 19.30 is the RMSE curve through 100 training epochs. 

a 

By using ANFIS, the estimated output d of the nonlinear passage is expressed as 
d(k) = f(n(k),n(k — l),n(k — 2)), as shown in Figure 19.31(a). Thus the estimated 
information signal x(k), derived as y(k ) — d(k), is shown in Figure 19.31(b). The 
difference between x(k) and x(k), as shown in Figure 19.31(c), was small when k 
ranged from 0 to 999 since the training data were obtained from this interval. The 
difference is most pronounced when k is equal to 6700 or so; improvement is expected 
when we collect training data from a longer interval. Note that collecting data from 
a longer interval usually generates an excessive number of training data pairs. (To 
keep training time reasonably low in this case, we usually need to perform some 
kind of data reduction to extract representative data pairs and weed out redundant 
ones.) 

By actually playing the audio signals x(k), y(k ) and x(k), we found that AN- 
FIS did a good job of removing the unknown distorted noise signal d(k) from the 
measured signal y(k). 
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20.1 INTRODUCTION 

This chapter introduces a neuro-fiizzy model, called a fuzzy-filtered neural net- 
work, for adaptive learning and feature detection. Its applications in the domains 
of plasma spectrum analysis, and pattern recognition axe described. The issues 
involved in using a neural network to process massive amount of possibly redun- 
dant input information axe analyzed and solutions axe suggested. In particular, we 
provide three architectures for fuzzy-filtered neural networks, which employ one- 
dimensional fuzzy filters, two-dimensional fuzzy filters, and genetic algorithm-based 
fuzzy filters, respectively. The validity, efficiency, and generality of the proposed 
models are verified by experimental results. 

A considerable shaxe of recent research on machine recognition has focused on 
the use of neural networks. An important step in most existing NN-based methods 
is extraction of the features of the patterns to be recognized. The goal of automatic 
feature extraction performance is twofold: reducing the complexity of the NN archi- 
tecture and increasing system adaptability. A better feature extraction mechanism 
not only leads to a higher recognition rate but also implies a simpler NN structure 
in most cases. Simpler architectures have a better chance of avoiding the problem 
of overfitting or overlearning, which is frequently encountered in complicated 
adaptive systems. 

The complexity and limitations of traditional neural networks axe largely due 
to the lack of an effective way to extract meaningful information from the learned 
configuration. This problem becomes more intractable when the number of physical 
sensors used for measurement increases. For example, an image to be processed 
contains thousands of pixels, which is fax more than the number usually needed for 
pattern recognition. In addition, the drifting of sensory equipment and variations 
in samples would cause any fixed-position feature detector to fail to recognize the 
resulting altered signal. To cope with these two factors and the general problem 
of background noise, a form of signal filtering is needed. In the following we shall 
introduce a mechanism for fuzzy filtering to cope with the complexity of feature 
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extraction. 

Moreover, even when a set of tightly structured fuzzy filters is used, redundant 
information is still read in. This issue cannot be handled well by simple adaptation 
because of the frequent occurrence of local optima. Consequently, we need another 
technique for global search. Genetic algorithms (GAs), as described in Section 7.2 
of Chapter 7, are a general method for optimization. GA-based fuzzy filters axe 
highly flexible because the size, shape, and location of the filters can be identified 
separately. Of course, the cost of employing a GA must also be taken into account 
if we axe to achieve better performance. We shall describe an effective mechanism 
to compensate for the computation cost introduced by using GAs. 

20.2 FUZZY-FILTERED NEURAL NETWORKS 

Fuzzy filtering is the task of partitioning a large number of physical data chan- 
nels into much fewer fuzzy channels. These channels, adaptable during a training 
process, axe employed for both noise filtering and feature detection. The bound- 
ary between two neighboring meaningful channels is assumed to be a continuous, 
overlapping area in which a physical channel has partial membership in both fuzzy 
channels. A fuzzy channel defines a range of input signal intensity characterized by 
an appropriate membership function. The position and shape of this membership 
function are adjusted during a learning process so that the system error is mini- 
mized. At the end of training, the fuzzy channels are expected to hook onto salient 
physical channels to provide a meaningful interpretation of the qualitative aspects 
of patterns. 

Using fuzzy filters as a mechanism for feature extraction has several advantages. 
First, the features detected are insensitive to variation in samples, and the effect of 
noise is negligible. Second, duplicated features are automatically combined. Third, 
this approach considerably reduces the complexity of the architecture, so that not 
only is the training efficiency improved but the possibility of overfitting is decreased. 
Finally, this model provides a meaningful interpretation of the detected features. 

Biological evidence suggests that a positional feature detector should preferably 
behave as a localized receptive field [4]. Mathematically, localized receptive fields can 
be represented as radial basis functions, as described in Section 9.5 of Chapter 9. 
We have shown that there exists a type of functional equivalence between radial 
basis function networks and fuzzy inference systems [3]. Furthermore, localized 
receptive field-based architectures axe more efficient than standard neural networks 
in terms of learning [5]. The preceding background justifies the choice of a fuzzy 
neural network as a feature extraction tool for problems with a massive amount of 
sensory input. 

We use a multilayer feedforwaxd adaptive network (Chapter 8) to implement the 
concept of fuzzy filtering. Figure 20.1 depicts a fiizzy-filtered adaptive network in 
which the Xj’s are inputs and the yfi s axe outputs. The nodes in the same layer 
have the same type of node function. 

Layer 1 is the input layer. Each node in layer 2 is associated with a generalized 




Figure 20.1. Fuzzy-filtered neural network. 


bell membership function: 

HA{Xi) = 1 2 ^ 7 , (20.1) 

1 , Xj-Cj * 

^ Oi 

where Xi stands for the position (or, equivalently, frequency) of a physical channel, 
A is the linguistic term associated with this node function, and {oi,6i,c,} is the 
parameter set. The node output is a normalized weighted sum: 


'EjVA(Xi)f(Xi) 

EiMzi) 


(20.2) 


where f(xi) is the intensity of input channel Xi. Thus, a fuzzy filter behaves as a 
bandpass filter with the added capability of learning. 

The initial values of the MF parameters are set in a way similar to that in 
Figure 12.6 on page 345. As mentioned in Section 12.6, although these initial 
MFs are set heuristically and subjectively, they provide an easy interpretation that 
parallels the human thinking process. The parameters are then tuned using steepest 
descent in a learning process based on the training data set. 

The nodes in layer 3 perform the same function as the hidden layer in a standard 
multilayer perceptron (Section 9.4); each node takes the weighted sum of inputs 
and produces a transferred output through a sigmoidal function. Layer 4 is similar 
except that the nonlinear transfer function is not used. 

In summary, fuzzy filters can be considered a type of locally receptive field which 
emphasizes local positional features. As a result, it is natural to use fuzzy filters 
to divide a massive number of physical input channels into a much smaller number 
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of fuzzy channels. Since we do not have a priori information about the shapes and 
positions of these fuzzy channels, it is desirable for them to be adaptable by training. 
Consequently, they can handle variation in samples and redundant information. In 
the following, we will apply the fuzzy-filtered neural networks in two domains: 
plasma analysis and character recognition. 


20.3 APPLICATION 1: PLASMA SPECTRUM ANALYSIS 

The importance of plasma analysis has long been asserted by both the scientific and 
the engineering communities. For problems ranging from outer space physics to 
medical diagnosis, the success of the solutions depends highly on our understanding 
of primitive but complicated information: spectral signals emitted from plasma. 

In VLSI manufacturing, for example, it is necessary to determine the endpoints 
of an etching process and to detect contamination in a chemical chamber. Existing 
recipes or rules of thumb are mostly heuristic and therefore not guaranteed to be 
reliable for all possible scenarios. Moreover, these approaches depend highly on the 
knowledge of which species (that is, spectral intensities at some frequencies) are 
indicative to the process; they cannot be used to identify the important species. 
The most direct way of monitoring a chemical process should be to use the full 
range of optical signals (the spectra) generated by an optical emission spectrometer 
to determine the actual status of the chemical reaction. 

Because of drifting of equipment, clouding of the chamber’s window, and vari- 
ations in the plasma, any single-wavelength detector is likely to miss an important 
signal at a specific frequency; therefore, some type of signal filtering is needed. Fur- 
thermore, we know that the number of essential species in the chamber is much 
less than the number of channels. Taking all these factors into consideration, we 
decided to use the fuzzy-filtered neural network as a suitable model for coping with 
the complexity and uncertainty of plasma analysis. 

20.3.1 Multilayer Perceptron Approach 

For comparison, we first used a standard multilayer perceptron (MLP) to achieve the 
desired mapping. Our MLP for plasma analysis is a three-layer (731-8-4) network 
with linear outputs. It examines 731 optical channels and uses backpropagation to 
adjust the weights so that the inputs are associated with four control variables of an 
oxide etching process: power, chamber pressure, and two gas (H 2 and CF 3 ) flows. 
The training data set was generated according to the center cubic experimental de- 
sign. We recorded the plasma’s spectral data when the etching is working desirably. 
The differences across spectra are so subtle that a process engineer cannot detect 
them. 

Due to the large size of inputs (731), we had to be careful in selecting the initial 
values of the weights connecting the input and hidden layers. If these weights are 
median, some of the hidden unit might be driven into saturation and the training is 
sluggish. On the other hand, if these weights are too small, then the training is also 
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slow due to a small gradient vector. Meanwhile, compensatory learning rates are 
set to match the small weights resulting from the large number of input channels 
so that fast convergence can still be achieved. With the carefully chosen initial 
weights and learning rates, the MLP can effectively identify small signal changes in 
a noisy environment. After 200 training epochs on 30 spectra, the root-mean-square 
error was driven to below 0.3%. The receding analysis can be applied to an on-line 
training situation. 

20.3.2 Fuzzy- Filtered Neural Network Approach 

Although a careful selection of initial weights can overcome the saturation problem 
introduced by a massive number of input channels, it is not the best solution. A 
simple fact is that we simply do not need to use all the 731 inputs in the first place. 
We should be able to single out the important inputs and combine the redundant 
ones. Furthermore, the standard MLP architecture is not able to tell us which 
input channels (species) are important in analyzing this chemical process because 
the weight distribution in the trained network does not show any clear patterns. 
Thus, we want to employ the fuzzy-filtered neural network to achieve three goals 
at once. First, we want to improve the learning performance. Second, we need to 
avoid possible overfitting due to a limited number of training data because etching 
experiments are expensive. Third, we aim to provide a meaningful interpretation 
of the trained network. 

The fuzzy filtering mechanism described in the last subsection simplifies the 
network architecture because far fewer system parameters need to be adjusted. The 
731-8-4 network mentioned in the previous subsection has about 731 x 8 + 8 x 4 = 
5880 weights to fine-tune; in comparison, a 731-15-15-4 fuzzy-filtered network has 
only about 3 x 15+15 x 15+15x4 = 330 parameters. Obviously, this benefits learning 
efficiency and reduces the chance of overfitting. A more important point is that the 
fuzzy approach provides a meaningful interpretation for the training results so that 
we can obtain a better understanding of the complex nature of plasma emission. 

A spectrum is shown in Figure 20.2, with the x-axis representing the optical 
channel number, which corresponds to the wavelength of a certain species, and the y- 
axis representing the (scaled) intensity of the plasma emission signal. As mentioned 
previously, we employed a 731-15-15-4 network to handle the data. Three channels, 
among the initial 15 channels, were lost in the lear ning process (driven out of the 
wavelength range) ; the remaining ones are shown in the figure. 

For an etching process, the important channels are typically those that have the 
highest intensities or change the most dramatically as the etching proceeds. We can 
match the wavelengths to the species that emit them. The fuzzy channels automat- 
ically identified two hydrogen peaks and a CO emission line (see Figure 20.2). The 
nitrogen lines indicate a possible leak in the vacuum system. 

In brief, the network learns on its own from the training data and selects the 
most pertinent channels without any guidance from human experts on plasma di- 
agnostics. The ability to detect features is one of the most important advantages of 
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Figure 20.2. A typical spectrum and fuzzy channels after training. The three 
values associated with each membership function are a, b, c in Equation (20.1), 
respectively. 


the fuzzy filtering mechanism. This capability is significant because human experts 
can actually learn from the network and gain a better understanding of the plasma 
discharge and thus develop better manufacturing recipes. 


20.4 APPLICATION 2: HAND WRITTEN NUMERAL RECOG- 
NITION 

To verify the generality of the fuzzy-filtered NN model, we applied it to another 
problem: hand- written numeral recognition. We used a numeral data set in the 
public domain as a benchmark. This set contained 3471 numerals from 49 different 
writers. The numerals have been spatially normalized to a 32 x 32 frame of pixels 
and shifted to the left. We used 3000 numerals for training and the remaining 471 for 
testing. Most of the testing numerals were written by persons different from those 
who wrote the numerals in the training set. This testing arrangement accurately 
simulates real applications, such as zip-code recognition. Figure 20.3 shows some 
examples of the numerals in the testing set. 

For comparison, we initially tested the data set on a standard three-layer (1024- 
30-10) feedforward neural network. It examined 3000 training numerals and used 
backpropagation to adjust the weights. After 250 training epochs, we tested it with 
the remaining 471 numerals. The recognition rate was 85%. Further, because of 
the network’s complicated architecture, its learning efficiency was poor. 
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Figure 20.3. Some examples of hand-written numerals. 
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Figure 20.4. The x-axis and y-axis spectra of some training images. Upper row, 
left to right: the x-axis spectra of digit 0 to digit 9; lower row: the y-axis spectra of 
the same digits. 


20.4.1 One Dimensional Fuzzy Filters 

To use the fuzzy filtering mechanism described in the previous section, we have to 
transform the numerals to spectrum-like signals by taking projections on both the 
x-axis and t/-axis. In other words, each 32 x 32 image must be transformed to a 
64-channel spectrum with each channel representing the number of black pixels in 
a column or in a row of the image. Figure 20.4 shows the x-axis and y-axis spectra 
of some training images. We used a 64-20-10-10 fuzzy-filtered network to learn the 
training patterns. The recognition rate was 90%, a better result than that yielded 
by the pure neural network. 
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Figure 20.5. A two-dimensional fuzzy-filtered neural network. 

20.4.2 Two Dimensional Fuzzy Filters 

Because the nature of image patterns is different from one-dimensional spectra, such 
as those used in plasma analysis, it is desirable to have fuzzy filters which handle 
two-dimensional data appropriately. It is easy to generalize the fuzzy-filtered NN 
model to a 2-D version, as shown in Figure 20.5. In this architecture, x-membership 
and y-membership axe integrated by a T-norm operator (e.g., multiplication). The 
combined membership is then used as the filtering weight. 

The initial distribution of membership functions is represented by the grids in 
Figure 20.6(a). After training, the grids may be adjusted to a distribution like that 
shown in Figure 20.6(b) to capture the important pieces of information. We used 36 
two-dimensional fuzzy filters in the experiment. The recognition rate was increased 
to 92%, compared with the rate of 90% obtained using 40 one-dimensional filters. 

The learning efficiency was improved. However, because the fuzzy filters are 
arranged as an array, the degree of freedom in terms of adaptability of one filter 
is constrained by its neighbors. In other words, the positions of these fuzzy filters 
are fixed to a certain extent because it is difficult for the gradient descent learning 
algorithm to overcome local optimal points. Apparently, we need a more flexible 
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Figure 20.6. 2-D fuzzy filters, (a) Before training, (b) after training. 


mechanism to identify shape and position in complex applications. In the next 
section, we describe an advanced fuzzy filter model based on genetic algorithms 
that achieves better performance without sacrificing learning efficiency. 


20.5 GENETIC ALGORITHM-BASED FUZZY FILTERS 

Genetic algorithms have been used in classification problems because of their ability 
to identify the weights of importance among features. GAs have been employed for 
feature selection in hybrid models with K-nearest neighbor algorithms [6] or with 
feature partitioning [1]. In this section we explore the possibility of integrating GAs 
with the fuzzy filtering model. 

For pattern recognition problems, the important positional features are typically 
those that have the highest or the lowest intensity and change the most dramatically 
across different numerals. As in the previous sections, we want to use only this kind 
of low-level, positional feature because for the machine to learn in a realistic sense, 
high-level human involvement should be avoided. 

In this study, various methods were developed for this purpose. Most of them 
achieved high recognition rates. We found that many factors affected performance 
in terms of recognition rate. In the following, we will describe all the methods we 
have tried and make some comparisons. 


20.5.1 A General Model 

In pursuing higher flexibility, we employed GAs to determine a set of rectangular 
regions; each will later be transformed into a fuzzy filter for each digit. We want 
these regions to represent a particular digit and not to duplicate the regions for other 
digits. In other words, we expect these regions to catch the positional features of 
a digit. The selection procedure of this GA was guided by the following evaluation 



544 


Fuzzy-Filtered Neural Networks Ch. 20 


□ D g 

~fao n 

□ 

□ 

n 

=! □ 

m 

u Dn 

^ □ 
HQ- 
r 

□ 

□ [jtjij 

0 

i 


2 


j 

4 

- 

□ 

Etr^ 


cP □ 

□ 1 


□ □ 

□ p 

□ nod 

nn 


5 6 7 8 9 


Figure 20.7. Rectangular feature detectors. 
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where n stands for a numeral, r stands for a region, b(r) is the number of black 
pixels in a region, and w(r) is the number of white pixels in that region. This 
formula says that to be a discriminant feature a region should have a high density 
of black pixels for one digit and low density for most other digits. Because we do not 
want the rectangles to overlap each other too much, we introduce the denominator 
term z , which is defined as one plus the total number of overlapped pixels of the 
rectangles represented by a chromosome string. In other words, the higher the 
degree of overlapping, the lower the evaluation score. 

We grouped together all x coordinates of the lower-left corners of the rectangles 
as the first section of the chromosome. The sections of y coordinates, heights, and 
widths follow in that order, as shown in Figure 20.9(a). After 400 generations, with 
a crossover rate of 0.005 per bit and a mutation rate of 0.005, the GA converged. 
It produced 10 rectangular positional feature detectors for each digit, as shown in 
Figure 20.7. 

We then fuzzified the rectangles by using two Gaussian functions to approximate 
the wall and the base of each rectangle. In other words, we created a membership 
function with its mean at the middle point of one side and with its deviation equal 
to half of the height or the width of a certain rectangle. Finally, we put these 
GA-generated fuzzy feature detectors in front of a standard neural network for 
classification training. The mechanism of this integrated GA-fuzzy-neural approach 
is shown in Figure 20.8. 

We found that the preceding architecture gave us the best performance up to 
this point (i.e., a 95.8% recognition rate). In other words, of the 471 numerals in 
the testing set, only 20 were misclassified. 
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Figure 20.8. GA-determined fuzzy-filtered neural network. This figure shows a 
two-phase training procedure. The numerals are first used to identify 100 regions 
by using 10 GAs, one for each digit. After that, we train the GA-determined fuzzy- 
filtered NN by using the same numerals to fine-tune the parameters. 


20.5.2 Variations and Discussion 

Since there axe many design parameters for GAs, we developed several GA versions 
to optimize the fuzzy-filtered NN model and compared their performance. In the 
following, we will describe the design motivation behind these variations of the basic 
model as well as the mechanisms they employed, including gene encoding schemes 
and choice of evaluation functions. 

We tried two gene encoding schemes for this model, as shown in Figure 20.9. 
The first scheme is the one discussed in the previous subsection. In the second 
scheme we have the four attributes (i.e., x coordinate, y coordinate, height, and 
width of a rectangle) grouped together as a unit. Table 20.1 shows the number 
of misclassified numerals of each digit. Some of the misclassified patterns of one 
scheme do not appear in the other schemes. 

Although the second scheme seems more natural for encoding individual rectan- 
gles, we found that the first one yielded to better performance. A possible reason 
is that diversity of feature locations is at least as important as individual features. 
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Figure 20.9. Gene encoding schemes, (a) All x coordinates of the lower-left comers 
of the rectangles are grouped together , and then the y coordinates, the heights, and 
the widths, (b) The four features of a rectangle — its x coordinate, y coordinate, 
height — and width, are grouped together. 

Table 20.1. Number of misclassified numerals when different gene encoding 
schemes were used. 


Digit 
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8 

5 

2 

7 
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The first encoding scheme takes this into account better than the other two because 
of the one-point crossover mechanism that we used. 

In this study, we tried several evaluation functions to guide the selection pro- 
cedure for GAs before we found the one described in Equation (20.3). Different 
evaluation functions, as expected, resulted in different filter patterns, which in turn 
affected the overall performance. For instance, when we employed only the first 
term in Equation (20.3) as our evaluation function, which is 


H r ) 

w(r ) 2 x z' 


(20.4) 


we obtained the set of fuzzy filters illustrated in Figure 20.10. This set of fuzzy 
filters looks similar to the written numerals. However, its recognition rate is lower 
than that produced by the set shown in Figure 20.7. The reason for this is straight- 
forward: When we consider positional features, the existence of pixels does not 
necessary imply discrimination between different digits. 

Since our model is a GA-NN combination, we also tried a hybrid training pro- 
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Figure 20.10. Another set of rectangular feature detectors. 
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Figure 20.11. Hybrid GA-NN learning cycle. The original idea comes from 
ref. [2]. 


cedure. The learning cycle is illustrated in Figure 20.11. In other words, after the 
GA has selected regions, we construct a fuzzy-filtered neural network to finish the 
training process. Consequently, the NN training error can be used as an evaluation 
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Figure 20.12. Superimposed image: Digit 0 (358 written numerals) represented by 
the intensity value on each position. 


score. The smaller the training error, the higher the evaluation score. This approach 
looks natural and complete; however, it is very time-consuming, and, surprisingly, 
it did not yield better results despite the extra time cost. 

Another important benefit of the proposed model is that with a simple data- 
preprocessing technique we can dramatically improve the training efficiency of GAs, 
which has traditionally been poor and has frequently been criticized as a primary 
drawback of GA-based optimization. The reason we can do this is that we can use 
superimposed numeral images instead of feeding the GA individual numerals one by 
one. Figure 20.12 shows a superimposed image of all the 0 numerals in our training 
set as an example. The GA in our study needs only about 30 seconds to converge 
for 400 generations on a Sun Sparc-10 workstation. This justifies the incorporation 
of GAs into our integrated model for their flexibility. 

The last point concerns the number and size of regions. We used 10 regions 
for each digit, which is good enough for this application. All the aforementioned 
methods use a variable rectangle size with a minimal value of 5x5. Our evalua- 
tion functions prefer small-sized rectangles because they tend to have higher pixel 
density. As a result, we put a lower bound on the size of the rectangles because it 
needs to be large enough to extract meaningful features. Generally speaking, when 
the difference between numerals of a digit is larger, we need larger regions for that 
digit. However, larger regions reduce the difference across digits. Thus, determining 
a proper lower bound for region size involves a design trade-off. 
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20.6 SUMMARY 

The banner of fuzzy logic, as pointed out by its inventor, Professor Lotfi Zadeh, 
is to exploit the tolerance for imprecision [7]. Because high precision entails high 
cost and low tractability, reasonable solutions to problems encountered in daily life 
usually employ knowledge at a compromised degree of precision granularity. In this 
sense, feature detection can be defined as an effort to extract essential attributes 
from a massive amount of information so that a pattern recognition problem can 
be solved efficiently. 

This chapter introduces a general methodology for adaptive feature extraction. 
The neuro-fuzzy model, called a fuzzy-filtered neural network, described here is not 
only capable of learning and adapting to variations in training samples, it also 
identifies pertinent features effectively. 

We first investigate the use of the proposed model to monitor a plasma environ- 
ment through optical signals. Simulations on experimental spectra substantiate the 
effectiveness of fuzzy channels. The location and shape of membership functions 
provide new insight into the complicated chemical reaction. This helps give us an 
idea of what wavelengths are of the greatest interest. Once the critical wavelengths 
are identified, further automation goals become achievable, such as endpointing, 
contamination monitoring, and process control. 

We also validate the generality of the fuzzy-filtered neural network by using it to 
identify important positional features for recognition of hand-written numerals. We 
explore three different architectures: one-dimensional fuzzy filters, two-dimensional 
fuzzy filters, and GA-based fuzzy filters. Again, experiments on a large-scale data 
set validate the effectiveness of the model. In this case the location and shape of 
membership functions are used to identify positional features for pattern recognition 
problems with a very low level of human involvement. In both applications our 
model gives us an idea of what kinds of features the machine could use to simulate 
human decision making. Once the features are identified, neural network learning 
become much easier and more efficient. 

This technique can also be used for pattern analysis in other fields, such as med- 
ical diagnosis or global change, in which the explanation of the detected features 
plays a role as important as the applicability of the working model. In brief, compli- 
cated patterns are not self-explanatory; they require a good deal of interpretation. 
Fuzzy-filtered neural networks successfully serve this purpose. 

In summary, the network described here actually learns on its own from the 
training data and selects the most pertinent positional features without any high- 
level guidance from human experts. This ability to detect features is the primary 
advantage of the fuzzy filtering mechanism. This capability is significant, because 
human experts can actually learn from the network and gain a better understanding 
of the problem it treats and thus produce better solutions. 

The primary limitation of this model is that it handles positional features exclu- 
sively. This model treats inputs as a multichannel spectrum, regardless of whether 
the spectrum is of single dimension or multiple dimensions. This method is very 
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effective for simple pattern recognition. However, for the task of analyzing com- 
plicated patterns, such as in Chinese character recognition, this method is less 
effective. In this type of task, it is preferable to use our model as a perception-level, 
subsymbolic module in conjunction with other human expertise-intensive models. 
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21.1 INTRODUCTION 

This chapter describes two approaches to integrating fuzzy set theory and genetic 
algorithms and explores their application to game playing. The proposed models 
indicate a promising direction for adaptation in a changing environment. We first 
discuss the realization of diversified selection by employing multiple coaches in a 
game-playing program with a genetic algorithm-based learning module. We show 
that when several coaches are used, the collective learning result is better than when 
only a single coach is involved, regardless of the ability of the coach. 

Next, we introduce two ways of incorporating fuzzy set theory into GAs. At 
different stages of a game, players usually focus on different features, which are 
used in static evaluation of board configurations. Thus, characterizing the features 
by means of membership functions is a natural extension of the basic model. This 
extension leads to better performance, as expected. Moreover, membership func- 
tions based on human expertise are not sufficient to cope with situations in an 
ever-changing environment. Consequently, we introduce the concept of fuzzy stages 
as an alternative solution. We integrate fuzzily divided game-playing stages with 
a technique called genetic structural expansion by employing a multistage chromo- 
some coding scheme in a genetic algorithm. In particular, this chapter compares 
three chromosome coding schemes — haploidy, triploidy (a special case of polyploidy ), 
and structural expansion — and discusses their impact on multiple fuzzy-stage game- 
playing strategies. 

21.2 VARIANTS OF GENETIC ALGORITHMS 

As mentioned in Section 7.2 of Chapter 7, genetic algorithms (GAs) were invented 
to simulate evolutionary processes observed in nature so that the goal of survival 
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or optimization in a changing environment could be achieved. A GA manipulates 
chromosomes which encode a set of parameters of a target system to be optimized. 
A GA uses three operators — selection (or reproduction ), mutation , and crossover — 
to achieve the goal of evolution. The single piece of information a GA receives 
from the environment is a scalar indicator that evaluates the performance of each 
chromosome. The GA then uses that evaluation to bias the selection of chromosomes 
so that those with better scores tend to reproduce more often than those with worse 
scores. In addition, GAs use mutation and crossover to create children that differ 
from their parents. 

It is well known that diversity helps a population survive under changing en- 
vironmental conditions, both in nature and in evolutionary computation. In other 
words, a successful evolutionary strategy should force chromosomes to exhibit di- 
versity so that the evolutionary process will not suffer from early saturation brought 
on by uniformity. In brief, it is important to keep the gene pot boiling to achieve 
the goal of continual evolution. Although this issue has long been addressed by 
researchers, very few satisfactory strategies have been suggested because of the 
difficulty of achieving a balance between fitness and diversity. 

The simplest way to achieve diversity is by increasing mutation rates or by in- 
troducing more radical mutation operators, such as the partial complement operator 
discussed in ref. [3]. It has been found, however, that this approach usually results 
in poor performance. Increasing the population size to compensate for this side 
effect does not work well either, because it results in poor learning efficiency. Other 
mutation mechanisms include random immigrants and triggered hypermutation [1]. 

Another method of enhancing diversity is to consider the distribution of indi- 
viduals as an evaluation factor in the selection process; see ref. [5] for an example 
of such a method. This concept was also employed in the rank-space method sug- 
gested by Patrick Winston in his AI textbook [15]. However, it is generally difficult 
to define distance adequately in a multidimensional search space. Recently, a sys- 
tematic discussion of disruptive selection has been presented that provides us with a 
better understanding of the behavior of a nonmonotonic fitness function [8]. A GA 
employing disruptive selection can be used successfully to solve problems such as 
optimizing a needle-in- a-hay stack function, which is traditionally GA-hard. Nev- 
ertheless, in general, we cannot establish a balance between fitness and diversity 
merely by selecting some of the most unfit members. 

A third approach was pioneered by Goldberg and Smith’s seminal work [4], in 
which chromosome structures are expanded to meet new challenges from the envi- 
ronment. Goldberg and Smith investigated the genetic mechanisms of diploidy and 
dominance and their effects on tracking the optimum chromosome of a changing 
environment. In this chapter, we follow in their steps by employing fuzzified poly- 
ploid chromosomes in response to different stages of a dynamic problem, such as 
game playing. 

A topic related to polyploidy is chromosome structure expansion (i.e., expansion 
aimed at developing more complicated gene structures to meet varying conditions 
in the environment, exploiting a certain niche, or achieving co-evolution [3]). Al- 
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though structural expansion has been considered as a means of increasing diversity 
to benefit the GA optimization process, it has seldom been studied in the context 
of a multistage reinforced environment. 

It is natural to apply different strategies at different stages of a problem. In most 
cases, stages in a dynamic system overlap each other (i.e., the boundaries between 
them are fuzzy rather than crisp). Consequently, we use membership functions [17] 
to characterize different stages in a temporally reinforced environment. Examples 
in this chapter axe open game , midgame , and end-play in game playing. Thus, dif- 
ferent strategies, encoded in polyploid chromosomes, should be integrated through 
a fuzzy combination scheme to provide a smooth transition between stages. Ever 
since Karr’s work on employing genetic algorithms to determine fuzzy rules [7], re- 
searchers have been trying to integrate these two paradigms, in different ways, to 
obtain optimal solutions and maintain flexibility at the same time [2, 11, 12]. Here 
we propose another way of applying fuzzy logic in evolutionary computation. 


21.3 USING GENETIC ALGORITHMS IN GAME PLAYING 

We test the combined concept of fuzzy set theory and GAs in the domain of Othello , 
which is a challenging game for human players because of the difficulty of envisioning 
the drastic board changes that result from moves. On the other hand, this game is 
of reasonable complexity for the task of learning because of its moderate branching 
factor. World-class Othello programs have been developed by Rosenbloom [13] 
and Lee and Mahajan [10], who applied techniques such as iterative deepening, 
move ordering, pattern classification, and Bayesian learning in their work. Their 
approach requires a great deal of human expertise for both game-playing strategies 
and learning mechanisms. The goal of this chapter, on the other hand, is to use fuzzy 
set theory and GAs to improve the learning capability of game-playing programs. 

Before explaining how GAs can be applied to the adaptation of game-playing 
strategies, we shall briefly describe the domain of Othello. The game is played by 
two players, Black and White, on an 8 by 8 board, which is initially set up as shown 
in Figure 21.1(a). Black starts the game by placing a black piece on any empty 
square on the board adjacent to one or more of White’s pieces. By this move, Black 
captures the adjoining white pieces, which are then flipped over to show their black 
side. Figures 21.1(b) and 21.1(c) show an example. There are two restrictions, 
however: (1) One of the bracketing pieces must be the piece just placed on the 
board, and (2) a move must flip at least one of the opponent’s pieces. When a 
player does not have any legal moves, he or she loses a turn. The players take turns 
placing pieces on the board until neither player can make another move. The player 
with the most pieces on the board is then declared the winner. (Refer to ref. [9] for 
further details.) 

Most successful game-playing programs employ a heuristic search that uses a 
static evaluation function to guide the direction of the search. A typical linear 
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Figure 21.1. Game of Othello. ( a) The initial setup.. After Black plays to e6 on 
(b), the board configuration changes to (c). 


evaluation function has the following form: 

n 

h = ^2w fi fi, ( 21 . 1 ) 

i— 1 

where h denotes the static evaluation function of a game board configuration, the 
fi s are features , n of them in total, that play important roles in game-playing strate- 
gies, and Wf .’ s axe the corresponding weights that indicate the relative importance 
of the features. With such an evaluation function, we can apply the well-known 
minimax search algorithm as well as alpha-beta pruning techniques. The power of 
a game-playing program is thus determined by two factors: how the discriminating 
features are selected and how the weights are assigned. These two factors have been 
the focus of a great deal of research since Arthur Samuel published his seminal work 
on machine learning [14]. In this chapter we concentrate on the second factor by 
using genetic algorithms. 

Genetic algorithms avoid local minima or suboptimal results. Consequently, 
GAs are ideal for searching the weight space of the heuristic function h used in game- 
tree algorithms. In the design of GA learning algorithms, we take into consideration 
many important factors (e.g., those summarized in ref. [6]). To play Othello games, 
our coach programs and GA learning programs employ features such as position, 
piece advantage, mobility, and stability. We treat a chromosome string as a vector of 
real- valued parameters which represent the coefficients of the game-playing heuristic 
function. 

Before beginning this research, we organized a local computer Othello tourna- 
ment with about 60 competitors. The top five players in the tournament were 
selected as the five coach programs. A full-width minimax search with alpha-beta 
pruning was commonly used in the coach programs. To deal with time constraints, 
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Figure 21.2. GA training of Othello players. 


some of the programs employed the iterative deepening strategy (i.e., they performed 
a full JV-ply search before attempting an N + 1-ply search). We will call the coaches 
Coach i, Coach 2 , Coach 3 , Coach 4 , and Coach 5 , respectively, in order of increasing 
power. In other words, Coach 5 was the program that won the tournament. The ra- 
tionale behind employing multiple coaches is twofold. First, it is difficult to choose 
a single evaluation standard for success in a real-world, multicriteria environment 
such as game playing. A necessary condition for a player to be “good” is that he 
or she can beat more than one “good” rival. Second, if we use only one coach for 
training, it is easy for individuals to overcommit themselves to that coach’s weak 
points and thus fall into local optima. 

All the programs were coded in C-h-h The population size was set to 200. 
Each member of the population took about 15 s on a 486/DX2-66 PC to finish a 
game. Evolutionary behavior was visible even with the relatively small population 
size. We employed fitness-proportional selection in this project. Each member of 
a certain generation played two games with one coach. Each pair of opponents 
took turns starting the two games. The fitness score was defined as the sum of 
the piece advantages of the two games. In terms of reproduction, we found that 
recombination, especially crossover, was very productive in yielding good offspring. 
Figure 21.2 summarizes the GA training process. 
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Figure 21.3. Performance curves when learning against (a) Coachi; (b) Coach 5. 


21.4 SIMULATION RESULTS OF THE BASIC MODEL 

To test the validity of the proposed model, we applied it in several learning situa- 
tions. In our first try, we used a game-playing heuristic function with six features: 
board position measure, piece advantage, current mobility (defined as the differ- 
ence in the number of possible moves between a GA player and a coach), stability 
(number of unflippable pieces), potential mobility 1 (number of opponent’s pieces 
adjacent to empty squares), and potential mobility 2 (number of empty squares 
adjacent to opponent’s pieces). 

For comparison, we initially let the GA player learn from a single coach. Learn- 
ing curves demonstrating the evolutionary processes against Coachi and Coachs axe 
shown in Figure 21.3(a) and 21.3(b), respectively. As expected, the best individual 
in each case learned to beat the corresponding coach because it successfully found a 
good weight combination for its heuristic function. The GA player won by a larger 
margin against Coachi than against Coachs, since Coachs was the tougher of the 
two coaches. 

Now we come to an interesting question: If the GA player has an opportunity 
to learn from all five coaches, will the result be better than that in cases where only 
a single coach is involved, regardless of whether the coach is the champion or an 
ordinary player? This question is not trivial, because we know that learning with 
multiple goals at the same time may result in useless interpolation. The only way 
to answer this question is by doing experiments. 

In the initial population for training with multiple coaches, each individual was 
scheduled to play two games with Coachi. In any later generation, only those 
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Figure 21.4. Performance curves when learning against all five coaches: (a) basic 
models; (b ) with expanded position features. 


individuals who had beaten Coach* and were not changed by the reproduction op- 
erators were scheduled to play against Coach* +1 . The evolutionary learning results 
are shown in Figure 21.4(a). 

Obviously, the learning effect was better than that with single-coached training. 
The best player in the final population against Coachi achieved a total score of 
more than 100, which was better than the result shown in Figure 21.3(a). The most 
remarkable result was that the best player against Coacli 5 was able to beat the 
coach by a margin of greater than 70. Note that the final score in Figure 21.3(b) 
was saturated below 50. Thus, keeping chromosomes with greater variety helped 
produce a performance breakthrough. From the learning curves we can observe 
that every time an individual developed a better weight combination, the result 
was likely to ripple out to affect its performance against other coaches. Moreover, 
we observe that the best players against different coaches were in general different 
individuals, thus the ripple effect was caused by crossover. 

After we found that the GA produced satisfactory results, we tried to enhance 
its potential ability to identify lower-level features. The current board position 
measurement plays a pivotal role in Othello. In the previous design the importance 
of each square was determined by human experience. To avoid this kind of high- 
level human involvement in learning, we broke the configuration feature of the 
whole board down into individual position features. In the expanded chromosome 
encoding, we represented the board position measure as 10 values, taking advantage 
of the symmetry of the Othello board. Furthermore, we found that a single feature 
of relative mobility was not enough to reflect the importance of the total number 
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Table 21.1. Play against Master. Two games were played. The table shows the 
sum of the final scores of our program players against Reversi Master. 
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of legal moves in different game-playing stages, such as opening moves and closing 
moves. Hence, we broke that feature into two subfeatures: the number of possible 
moves for the player and the number of possible moves for the coach. Thus we had 
a total of 17 features in the revised game-playing heuristic function. 

The learning curves with this expanded chromosome format are shown in Fig- 
ure 21.4(b). As expected, the learning speed was reduced because it was time- 
consuming to find the correct setting of importance for each position. However, 
interaction between coach curves occurred much more frequently than before. 

Another interesting point is that the order of the coaches in terms of power, 
from the GA player’s point of view, was not quite the same as the ranking found in 
the tournament. For example, the GA player beat Coacli 3 by a larger margin than 
Coacli 2 , whereas the former beat the latter in the tournament. Moreover, as pointed 
out in ref. [16], an interesting relationship could develop between average human 
players (Hs), expertise-based game-playing programs (Es), and GA-based learning 
programs (Ls). This relationship would be cyclic, just like that between the scissors, 
paper, and stone in the well-known children’s game: Although Es outperform Hs, 
Ls learn to beat Es, and Hs are, in most cases, able to beat Ls easily. The apparent 
reason for this relationship is that Ls overcommit themselves to the weak points of 
Es and develop a narrow strategy that is not good for generalization. 

Since it is interesting to see whether training with multiple coaches enables Ls 
to avoid this drawback, we had the programs play against the commercial Othello 
program Reversi for Windows There are four levels in Reversi: Beginner , 
Novice, Expert , and Master. We chose the Master level for our program players, 
including all the coaches and the best GA players against each coach. Since we 
had two versions of GA encoding, we use the superscripts 6 and 17 to denote the 
number of genes used in the chromosome format. For example, Best^ 7 denotes the 
best player against Coadi 2 using 17 features. We gave our programs 0.3 seconds 
for each move; this was in general less than the time consumed by Reversi, which 
typically used 2 to 3 seconds for its midgame search. The results are summarized 
in Table 21.1. 

If we consider the BestJ 7 ’s, the cyclic relationship is clear. BestJ 7 beat Coach; , 
Coach; beat Master, and Master beat Best; 7 , with Best^ 7 the only exception. How- 
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ever, if we consider the Bestf’s, which employed more human knowledge in terms 
of position evaluation than the BestJ 7 ’s, the outcome was different. In this case, 
we had three GA players, Best®, Best®, and Best®, who beat both their coaches 
and the Reversi Master program. Although these results enhanced our belief in the 
learning ability of GAs, they also left us many questions to answer in the future. 

21.5 USING FUZZILY CHARACTERIZED FEATURES 

In an Othello game there are at most 60 moves. We know that different strategies 
should be employed in different game stages. However, in most cases, the stages 
can only be described linguistically. For example, “keeping high mobility is im- 
portant at the beginning ” and “in the endgame stage one should capture as many 
pieces as possible” axe two obvious strategic rules in playing a game. Linguistically 
characterized game features axe suitable fox xepxesenting stxategies. In this section, 
we will xealize this concept directly using fuzzily characterized features. In the next 
section, we will propose a general model based on fuzzy polyploid chromosomes. 

The general form of the static evaluation function h(t) is now as follows: 

n 

h (t) = 5Z **/< (*)«>/< fi, (21.2) 

i = 1 

where ///<(£) is the membership function characterizing feature fi, with t denoting 
the ply number in a game. 

We employed the expanded seventeen features in this set of experiments. Each 
feature has its own function to indicate its importance as a game proceeds. The 
membership functions chaxacterizing these features are shown in Figure 21.5. 

The 10 positional features are largely time invariant. In other words, the im- 
portance of a position in Othello does not change dramatically as a game proceeds. 
However, as the end of a game approaches, players’ freedom of choice concerning 
which position to take decreases steadily. Thus, the corresponding curve drops down 
near the end. The same observation applies to the two potential mobility measures. 
Because at the final stage of a game there axe fewer and fewer empty squares left 
on the board, the relative importance of potential mobility should decrease. 

In contrast, experience tells us that in an Othello game we should not be too 
greedy (i.e., we should not emphasize the piece advantage in the early stages of a 
game). As the game proceeds, we will gradually increase our focus on this feature 
to capture as many pieces as possible. 

On the other hand, because we also emphasize the number of possible moves 
that our opponent can make, with the aim of forcing him or her to pass in the final 
stage of the game, we treat mobility as an important feature no matter what stage 
of the game we axe currently in. Similarly, because we do not want our pieces to be 
flipped at any time, we use a function with the unit value everywhere to indicate 
the importance of stability. 

The learning curves against the five coaches of the GAs employing fuzzified fea- 
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(f) Potential Mobility 2 


Figure 21.5. Membership functions characterizing game-playing features. 


tures are shown in Figure 21.6. Compared to Figure 21.4(b), the overall performance 
and the learning speed of the GAs with fuzzified features axe clearly better than 
those with pure GA training alone. This result should not be surprising because we 
actually put more knowledge into the system than before. 

In summary, employing fuzzified game-playing features allows the static evalu- 
ation function to emphasize different features at different stages, so that the eval- 
uation score of the board configuration is more appropriate than before. However, 
since the membership functions characterizing the features are determined by hu- 
man experts, there is still room for improvement in terms of automatic learning. 


21.6 USING POLYPLOID GA IN GAME PLAYING 

After discussing fuzzily characterized game features , we now introduce an even more 
flexible approach by using fuzzily characterized game stages. By dividing a game 
into several fuzzy stages, we actually consider the features in a dynamic way, which 
is intuitive and likely to result in better performance. 

Take a three-stage game as an example. In this case, we employ three member- 
ship functions to characterize the three stages of the game: open game, midgame , 
and end game, as shown in Figure 21.7. Since it takes 60 plies to finish an Othello 
game, we use the thirtieth ply as the core of the midgame membership function. 
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Figure 21.6. Performance curves: Learning against five coaches, with fuzzified 
features. 


Membership 



- Ply 


Figure 21.7. Three Stages of an Othello Game Characterized by Fuzzy Membership 
Functions. 


Note that both open game and end game overlap with midgame, so a natural tran- 
sition is supported in this fuzzy scheme. 

Each stage needs a corresponding static evaluation function, which is represented 
by a chromosome. Consequently, we have three chromosomes for each member in a 
population, as shown in Figure 21.8(b). This chromosome coding scheme is called 
triploidy , which is a special case of polyploidy. The overall value of the static 
evaluation now is the weighted average, governed by the membership grades of the 
current ply, of the individual evaluation functions represented by the individual 
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Figure 21.8. Gene structures, (a) Haploid chromosomes, (b) triploid chromo- 
somes, (c) structural expansion. Each dashed-line box represents a population. Each 
solid-line box stands for a chromosome, which is a collection of game-playing fea- 
tures. 


chromosomes. Formally, the static evaluation function h(t) can be written as 


h m (t) — 

k=i 


n 

Vk(t)^2 w fi( k )fi > 

i— 1 


(21.3) 


where k stands for a fuzzy stage (m stages in total) and fikit) is the membership 
value of ply t in stage k. Since we have a set of genes (features) for each stage in 
this case, we use fi(k ) to denote the zth feature in stage k. Compare the preceding 
formula with Equations (21.1) and (21.2) to sec the different focus of each method. 

In nature, polyploidy is usually accompanied by a type of dominance mechanism 
so that at any time only one chromosome acts as the surrogate of the group. The 
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chromosomes other than the dominant one are called recessive chromosomes. Our 
approach treats polyploid chromosomes in a different way. Their relationship is one 
of cooperation rather than dominance. In other words, each chromosome’s role is 
not one of all-or-nothing participation but rather one of being part of a weighted 
combination guided by membership functions. 

An interesting question to consider is whether polyploidy is an original form or is 
itself a result of evolution. In other words, in the context of our application, should 
we employ triploid chromosomes at the very beginning or somewhere further on in 
the process of evolution? To answer this question, in addition to the traditional 
haploidy in Figure 21.8(a) and the proposed triploidy in Figure 21.8(b), we also 
used a third scheme: structural expansion at a certain point during the evolutionary 
process. Figure 21.8(c) illustrates this concept. In practice, we chose to combine 
the haploid members in the twentieth generation to form triploid chromosomes. 
Since long chromosomes usually slow down learning considerably, we used delayed 
structure expansion to alleviate the effect of expansion on learning speed. Further, 
at the twentieth generation the members are largely elite according to the judgment 
of our five coach programs. Thus combinations of these individuals have a better 
chance of producing improved triploid members than the combinations produced 
by the random initialization procedures employed in simple triploid experiments. 

The learning curves against the five coaches of the GAs employing triploidy and 
structural expansion are shown in Figures 21.9(a) and 21.9(b), respectively. On av- 
erage, structural expansion produced better performance than simple triploidy, as 
expected. After the twentieth generation, the point of structural expansion, there 
are obvious breakthroughs on the curves in Figure 21.9(b). Moreover, the interac- 
tion between curves is also relatively frequent in the case of structural expansion, 
which can be considered an indicator of diversity. 

To compare the performance achieved with haploidy, triploidy, and structural 
expansion in more detail, we summarize the scores of the fifty-sixth generation and 
that of the one-hundred-fortieth generation corresponding to the three chromosome 
coding schemes in Tables 21.2 and Table 21.3, respectively. Although the character 
of the coaches affected the variance to a certain degree, from the data in these 
tables we can still draw two important conclusions. First, simple triploidy did not 
produce satisfactory improvement because randomly initiated longer chromosomes 
had more difficult evolving. Second, structural expansion yielded the best results 
in terms of final performance and learning efficiency. 

In our final experiment, we pushed the concept of structural expansion to its 
limit. This time we used a GA that started with haploid chromosomes, and then, af- 
ter every 10 generations, increased the length of each chromosome by one unit. The 
method we used was to duplicate one unit in each of the polyploid chromosomes. 
This concept is demonstrated in Figure 21.10. The increased stages were charac- 
terized by corresponding multistage membership functions as before. In an Othello 
game we cannot, of course, expand the chromosome structure endlessly because it 
is not reasonable to divide a game into too many stages. Thus, we stopped the 
expansion mechanism at the onc-hundredth generation (i.e., 10 fuzzy game-playing 
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Figure 21.9. Performance curves when learning against five coaches: (a) with 
triploid chromosomes; (b) with structurally expanded chromosomes. 


Table 21.2. Comparison: fifty-sixth generation. Two games were played between 
each player and each coach. The table shows the sum of the final scores of program 
players against coaches. Three chromosome structures are compared in the table: 
haploid, triploid, and structural expansion. The scores were taken from the fifty- 
sixth generation of evolution in each case. 



Haploid 

Triploid 

Expansion 

Coachi 

+50 

+40 

+40 

Coach 2 

+32 

+46 

+36 

Coacli 3 

+52 

+27 

+78 

Coacli 4 

+17 

+9 

+14 

Coach 5 

+13 

+70 

+29 


stages). The learning curves against the five coaches of the GAs employing this 
type of structural expansion are shown in Figure 21.11. Obviously, this mechanism 
produced better results than any of the triploid cases. 

21.7 SUMMARY 

Maintaining diversity in GAs is an important issue worthy of further investigation. 
In addition to the traditional approach of trying to construct a balanced criterion 
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Table 21.3. Comparison: one-hundred-fortieth generation. The specifics are the 
same as in Table 21.2. The scores were taken from the one-hundred-fortieth gener- 
ation of evolution in each case. 



Haploid 

Triploid 

Expansion 

Coachi 

+68 

+113 

+70 

Coach 2 

+51 

+46 

+74 

Coach3 

+69 

+44 

+78 

Coach4 

+49 

+15 

+50 

Coach 5 

+38 

+84 

+84 


between chromosome diversity and evolutionary convergence, we can also explore 
the promising alternative of using multiple and dynamic standards of survival or 
success. Game-playing provides a fertile ground for experiments in this area. In 
this chapter we have focused primarily on two aspects of game playing: learning 
behavior with multiple coaches and the effects on learning of using fuzzified features. 

This chapter describes two methods of integrating fuzzy set theory and genetic 
algorithms. The first employs heuristically determined membership functions to 
indicate changes in the importance of features as a game proceeds. The second, 
even more flexible, approach explores the use of polyploidy, which encodes game 
stages in chromosome structures. This allows us to apply a different set of feature 
weights in different game stages. Furthermore, the stages are fuzzily characterized 
so that no abrupt jump occurs across stage boundaries. Both of the aforementioned 
mechanisms produced satisfactory results. 

The attributes and mechanisms of polyploidy remain important issues worthy 
of further study. Possible future work includes adaptation of membership functions 
so that a higher degree of flexibility in search of optima can be achieved. It is 
also possible to vary the degree of polyploidy to explore the relationship between 
complexity and performance. Another direction is to introduce Lamarckian prop- 
erties into the paradigm used here. After a fast learning algorithm is applied to 
individuals in a generation, the resulting weight vector could be encoded back into 
the chromosomes so that the adapted behavior of the parents could be passed on 
to their children. 
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Figure 21.10. Evolutionary gene structures. Each chromosome expands itself 
by one unit after every 10 generations. One unit in the polyploidy chromosome 
duplicates itself in the structure. The transition from the 3rd stage to the fth stage 
demonstrates duplication. 
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Chapter 22 


Soft Computing 
for Color Recipe Prediction 


E. Mizutani 


22.1 INTRODUCTION 

Colors give meaning and value to our daily lives; for example, painting a room 
the proper color can enliven it and make it more comfortable. We often need to 
specify to painters the favorite color coming from a pigment of our imagination. 
Using color recipe prediction as a practical application of soft computing in 
the paint industry, this chapter introduces readers to a neuro-fuzzy methodology we 
discussed in Chapter 13 and another computational intelligence approach that 
combines a knowledge base (KB) and three principal soft computing components: 
fuzzy systems (FSs), neural networks (NNs), and genetic algorithms (GAs). They 
function complementarily when put together; their synergism consequently presents 
unexpected performance enhancements. 

We shall demonstrate how the fusion of techniques surpasses the individual ca- 
pacity of any one technique: Color matching is an excellent test of these methods 
because it is difficult even for skilled human operators to do well, yet human color 
perception is sensitive, and therefore the matching must be done well to meet ac- 
ceptable standards. In the next section, we briefly introduce the color recipe predic- 
tion task. We then present a simple backpropagation multilayer perceptron (MLP) 
approach in Section 22.3. After that, we discuss a neuro-fuzzy approach in Sec- 
tion 22.4 and subsequently describe a genetic neuro-fuzzy approach in Section 22.5. 
Finally, this chapter concludes with a futuristic picture of soft computing ; emerging 
soft computing intelligence may revolutionize approaches to developing industrial 
applications. 
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16 Inputs 


Target Color 



Measure 



10 Outputs 



Reflectance 


Proportions 


Figure 22.1. Input-output relation in a typical color recipe prediction system. 


22.2 COLOR RECIPE PREDICTION 

Color recipe prediction relates the surface spectral reflectance of a target color to 
a list of several required colorant proportions that are needed to produce the same 
color as the reference color (see Figure 22.1). In a practical situation, it is necessary 
to examine the color match in daylight as well as in artificial light. This is an 
arduous task even for professional colorists. A succinct description of the main 
concerns in the recipe prediction is presented in Table 22.1. 

In our prediction task, we had 1446 training samples of Munsell color chips and 
302 checking samples of standard paint color chips from the Japan Paint Manufac- 
turers Association. Both data sets were sampled by surface spectral reflectance of 
target colors at 16 points in the visible range of the color spectrum between 400 nm 
and 700 nm in wavelength (20-nm intervals). With regard to (P2) in Table 22.1, 
the desired average number of colorants required to produce any color was about 4 
out of 10 colorants, as presented in Table 22.2. Those 10 types of colorants included 
three pairs of the same types of colorants (i.e., green, yellow, and red ones) and also 
complementary colorants such as “green and red” and “blue and yellow” (see the 
10 output units in Figure 22.1). (That is, we carefully determine which colarants 
to use, avoiding use of the same colorant types and complementary colorants at the 
same time.) All subsequent experiments were conducted using the same data sets. 


22.3 SINGLE MLP APPROACHES 

Since the conventionally used Kubelka-Munk theory requires certain assumptions 
that limit the situations in which the theory may be applied [14], a simple backprop- 
agation MLP approach has been introduced as an alternative method to overcome 
practical obstacles in color recipe prediction [2, 9, 11]. 

Two types of simple MLPs, -/ViVnorm and NN m0 &, have been applied as a 
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Table 22.1. Main concerns in color recipe prediction. 


(PI) 

It is difficult to predict precise colorant concentrations. We sometimes need 
to predict proportions with enough precision to specify levels such as 0.01%, 
which is the desired minimal colorant proportion level. 

(P2) 

It is necessary to specify use of a limited number of colorants to use for 
acceptable cost performance requirements. At the same time, in the choice of 
colorants, we need to avoid the use of complementary colorants and of the same 
types of colorants. 

(P3) 

The magnitude of mean-squared error of colorant vectors may not correspond 
exactly to that of color difference. The question is which colorant has the most 
significant impact on the entire color. For instance, if the target color is very 
bright, we have to determine carefully the concentrations of dark-colored 
pigments. 

(P 4) 

It is important to consider human visual sensitivity to color difference , which 
is closely related to perceptual attributes of color (i.e., lightness , hue , and 
chroma [4, 14]). 

(P5) 

Some different combinations of colorants may have the same perceptual 
attributes of color as seen by humans. 


Table 22.2. Number of data classified by the desired number of colorants required 
to produce color in data sets. 



Two colorants 
desired 

Three colorants 
desired 

Four colorants 
desired 

Desired average # 
of colorants to use 

1446 

training data 

4 

60 

1,382 

3.95 

302 

0 

13 

289 

3.96 

test data 


touchstone to the recipe prediction to fathom the intrinsic difficulty of the task [9]; 
NNnovm has normal sigmoidal functions and NN mo< ^ has modified sigmoidal func- 
tions in the output layer. Both iViVnorm and NN mo d have the same model size 
(16 x 18 x 21 x 10 neurons), mapping surface spectral reflectance of a target color 
(16 sampled inputs) to a list of required colorant concentrations (10 outputs) (see 
Figure 22.1). 

As indicated in (P2) of Table 22.1, we need to specify which colorants to use. 
Table 22.2 shows the desired number of colorants in our data sets. The average 
number of colorants required to produce any color is fewer than five; this means 
that 6 of the 10 final outputs should be zero. In addition, we sometimes need 
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Figure 22.2. Number of necessary colorants predicted by an MLP with the normal 
sigmoidal functions ( NNnorm ), and an MLP with the modified sigmoidal functions 

(NN m od)‘ 


to predict proportions with enough precision to specify levels such as 0.01% [see 
Table 22.1 (PI)]. It is an important concern in color recipe prediction to specify 
such output range extremities [2]. 

To handle these concerns, we have introduced modified sigmoidal functions and 
truncation filter functions in the output layer [8, 9] (refer to Section 13.3.3 for 
details). These functions prevent an NN from exceeding the desired output range. 
Thus the outputs are further processed to eliminate redundant colorants at the 
minimum of the desired output range. 

The effects of the modified sigmoidal functions can be seen more clearly in 
Figure 22.2. The -/ViVnorm tends to specify use of more colorants than necessary; it 
averages almost seven specified colorants, which is far from the ideal of about four. 
On the other hand, in Figure 22.2, the ATAr m0( j shows that the predicted number 
of colorants asymptotically approached the ideal number of colorants as iterations 
progressed. The comparison of prediction accuracy between NNnovm and NN m0 ^ 
is detailed in Table 22.3; ArAr m0( j was more effective in avoiding use of the same 
types of colorants and of complementary colorants than NNnovm- 

Although NN m0( ^ did a better job than A^A^norm, greater precision in concen- 
tration specification is desired. This has inspired us to construct hybrid systems. 

22.4 CANFIS MODELING FOR COLOR RECIPE PREDICTION 

In practice, we sometimes encounter severe standards which may be hard to meet by 
employing a backpropagation MLP alone. We contend that another approach, such 
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Table 22.3. Performance comparison of single MLP approaches: NNnorm and 
NN mo d’ us i n 9 ^02 checking data. NN n0 rm is a simple backpropagation MLP, and 
NN mo d an improved NNnorm • A column, “Error, ” denotes the average colorant 
error. 



Ave. # of 

Error 

No. of test data which outputs same or com 

plementary color 


colorants 

xlO -2 

2Green 

2Yellow 

2Red 

Red &: Green 

Yellow & Blue 

NNnorm 

6.66 

2.616 

97 

154 

125 

198 

129 

N^mnri 

3.90 

2.031 

14 

7 

5 

0 

1 

Ideal 

3.96 

0 

0 

0 

0 

0 

0 


as fuzzy modeling, must complement simple MLPs to enhance overall performance. 
In this section, we show how neuro-fuzzy models can be generalized for applica- 
tion to color recipe prediction; the neuro-fuzzy approaches are expressed within 
the framework of CANFIS (Coactive Neuro-Fuzzy Inference Systems), detailed in 
Chapter 13. 

To find an ideal adaptive model for this task, we have investigated a variety 
of structures. They feature knowledge-embedded architectures and an adaptive 
FS, which serves to determine color selection. They have enormous potential for 
augmenting prediction capacity. 

22.4.1 Fuzzy Partitionings 

In fuzzy modeling, it is important to determine a reasonable number of membership 
functions (MFs) to maintain appropriate linguistic meanings. In the ANFIS simu- 
lation examples in Chapter 12, MFs were set up for all inputs using grid partitions, 
but this is questionable; the color recipe prediction problem has 16 surface spectral 
reflectance inputs and 10 colorant proportion outputs as depicted in Figure 22.1. 
When we pick 16 values (Xi, . . . X i6 ) from the surface spectral reflectance curve of 
a given target color, can we specify any rule for each value? Or is it necessary to 
establish MFs for each input value? We must have the following 16 fuzzy rules: 

Rule 1: If X\ (at 400 nm) is then use a rule, C\. 

Rule 2: If X 2 (at 420 nm) is A 2 , then use a rule, C 2 . 

Rule 16: If Xi6 (at 700 nm) is Aie, then use a rule, C\%. 

In these rules, Ai is a fuzzy linguistic label. (Note that the visible color spectrum 
is 400 nm to 700 nm.) These rules may not make sense since we do not have such 
explicit knowledge per wavelength. Without explicit domain knowledge, adaptive 
learning mechanisms enable ANFIS/CANFIS to build up fuzzy rules automati- 
cally [5]. But if the initial MF setup has no meanings, it is futile to extract fuzzy 
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rules from a fuzzy logic point of view. Blindly applying fuzzy MFs to all scalar 
inputs may turn out to be meaningless. Also, if there axe a great many inputs, the 
curse of dimensionality problem arises. The number of MFs should be carefully 
determined so that fuzzy rules can be held to meaningful limits. Fortunately, there 
is a formula for transforming the surface spectral reflectance of color to perceptual 
attributes, “lightness,” “hue,” and “chroma” [4, 14] (see also Section 22.5.5). These 
three values must be more suitable for treating color in a linguistically meaningful 
way than the 16 spectral values mentioned previously, and so we use them as our 
MF inputs. When we invert the 3-D partitions in the color attribute space to the 
16 dimensions of the spectral input space, certain complicated partitions must be 
constructed in the 16-D input space. In this way, we realize a complicated fuzzy 
partitioning. 


22.4.2 CANFIS Architectures 

First, we consider just one perceptual attribute of color hue as a linguistic variable. 
Using hue alone, we build up fuzzy MFs on the polar coordinates that define five 
color regions: red, yellow, green, blue, and violet (see the membership value genera- 
tor in Figure 22.3). Specifically, fuzzy rules in the if-then format serve to determine 
color selection. For instance, 

Yellow rule: If the target color is “yellow,” then use a “yellow” rule, C y . 

Each color MF specifies the degree of membership of a color region and assigns the 
degree value to each color rule (rule’s consequent) as the firing strength. In the 
preceding yellow rule, the firing strength (W y ) is determined by the yellow MF. 

To introduce more MFs may lead us to better results. Hence we also consider a 
case in which each color region has three MFs to express its three degrees of color. 
For example, concerning the yellow region between green and red area, the following 
rules apply: 

Yellow rule 1: If the target color is “greenish yellow,” 
then use a “greenish yellow” rule, C gy , 

Yellow rule 2: If the target color is “very yellow,” 
then use a “very yellow” rule, C vy , 

Yellow rule 3: If the target color is “reddish yellow,” 
then use a “reddish yellow” rule, C ry • 

In this case, we have 15 linguistic values (e.g., “greenish yellow”) on one linguistic 
variable (hue) alone [see CANFIS (b) in Tables 22.4 and 22.6]. To introduce more 
than 15 MFs onto only hue may result in less interpretability from a fuzzy logic 
standpoint, and resulting fuzzy rules may be ill defined or hard to understand 
simply because of the difficulty of specifying the difference between greenish yellow 
and very yellow that humans perceive by saying “slightly greenish yellow” or using 
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Membership Value Generator 



Figure 22.3. CANFIS with five color rules for color recipe prediction. 


some other vague description. 

Instead of increasing MFs, we can construct more sophisticated rules’ conse- 
quents, such as neural rules (or local color expert NNs), as we have discussed in 
Chapter 13. Figure 22.3 illustrates such a CANFIS with five color rules [which 
correspond to CANFIS (d) and (e) in Tables 22.4 and 22.6]; one color MF is po- 
sitioned for one color region. This CANFIS model can be viewed as a variation of 
the modular network we previously discussed in Sections 13.3.2 and 9.6. The given 
prediction task is decomposed into five color rules or five local color experts, which 
form rules’ consequents. In Figure 22.3, the “green rule” is expressed in a neural 
rule with 16 spectral reflectance inputs. Each rule can be a linear rule, a sigmoidal 
rule, or a neural rule. 
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Figure 22.4. CANFIS with 11 MFs (45 fuzzy rules ) for color recipe prediction. 


So far, we have discussed how to build up a CANFIS with several MFs for 
the hue aspect alone. Next, we take into account all three perceptual attributes 
of color — lightness, hue, and chroma — to alleviate the problem (P4) in Table 22.1. 
Specifically, in our experiments, we set up three MFs for lightness, and chroma, 
respectively, and five color MFs for hue. Hence, we have CANFIS with 45 fuzzy 
rules, as illustrated in Figure 22.4 [see CANFIS (c) in Tables 22.4 and 22.6]. 

The discussed CANFIS architectures may have too many adjustable param- 
eters. To accelerate learning, we can employ the modified bell MFs defined in 
Equation (22.1) to control the number of firing rules (i.e., local experts). This may 
be useful because it may prove unnecessary to use more than two color rules at the 
same time: For instance, when the target color is in a region between green and yel- 
low (that is, when the yellow rule and the green rule are fired) , the neighboring red 
rule and blue rule are not necessarily fired because of the yellow-blue and green-red 
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complementary color relationships. Several weight-updating procedures for unnec- 
essary or inactive rules can then be skipped when iterative training procedures are 
employed. 

The definitions of the modified bell MF and the original bell-shaped MF are 
presented next for the purpose of subsequent discussion: 

Mmod(z) = max 1 1 + | Lcp “ °} > C 22 - 1 ) 

MoriginalC^) = ^ j j x-c | 2 ft’ (22.2) 

where {a, b, c} is an adjustable parameter set. The modified bell MF is just 
the upper half part of the original bell-shaped MF and has a limited base width 
(support). 

22.4.3 Knowledge-embedded structures 

Adaptive fuzzy MFs specify the degree of membership of five color regions (red, 
yellow, green, blue, violet) according to perceptual attributes of color. They de- 
termine what weight should be assigned to each rule’s output to produce a final 
output. We have applied the colorist’s judgment to the CANFIS architecture; sev- 
eral connections to the fuzzy association layer can be pruned. This idea is pictured 
in Figure 22.3; for instance, the green rule has no connection line to red units at 
the fuzzy association layer. That is, a green rule (weighted by a green MF) has 
no effect on red colorant proportions because of the green-red complementary color 
relationship. In this way, each neural color rule has fewer output units than the 10 
final CANFIS output units. The unit (or neuron) numbers are clearly presented in 
Table 22.5. 

As previously stated, the predicted number of colorants should be about four; 
this means that 6 of the 10 final outputs should be zero. Reducing the number of 
zero outputs through the pruning procedure can have a positive impact on the con- 
struction of the desired input-output mappings inside CANFIS. This modification 
is intended mainly to eliminate the problems of (PI) and (P2) in Table 22.1. 

22.4.4 CANFIS Simulation 

We explored CANFIS with linear or nonlinear rules with different MF setups. Ta- 
ble 22.4 shows five representative CANFIS descriptions in the simulation. In any 
CANFIS, we used the truncation filter function in the output layer. 

Although we implemented CANFIS (b) with 15 linear rules extensively, we did 
not obtain better results than CANFIS (a) with five sigmoidal rules. When the 
rule number was increased, difficulty in determining initial parameter setups was 
encountered. 

When we use CANFIS with five neural rules, as depicted in Figure 22.3, there 
may be many possible optimal rule formations; Table 22.5 shows one of them, which 
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Table 22.4. Five representative CANFIS models for color recipe prediction. 


(a) 

CANFIS with 5 sigmoidal rules, as shown in Figure 22.3, 
with pruned connections 

5 bell-shaped MFs are set up for hue angle alone 
(i.e., 5 rules are for five color regions) 

(b) 

CANFIS with 15 linear rules 
with no pruned connections 

15 bell-shaped MFs are set up for hue Jingle alone 

(c) 

CANFIS with 45 rules, as shown in Figure 22.4, 
with no pruned connections 

3 bell-shaped MFs are set up for lightness 

3 bell-shaped MFs are set up for chroma 

5 bell-shaped MFs are set up for hue angle 

(d) 

CANFIS with 5 neural rules, as shown in Figure 22.3, 
with no pruned connections 

5 modified bell MFs are set up for hue angle alone 

5 neural color rules have the same model size 
(i.e., each neural rule has 22 hidden units) 

(e) 

CANFIS with five neural rules, as shown in Figure 22.3, 
with pruned connections 

5 modified bell MFs are set up for hue angle alone 

5 neural color rules are heuristically optimized independently 
Those rules’ model sizes are specified in Table 22.5 


was found by a process of trial and error. [This CANFIS corresponds to CANFIS (e) 
in Tables 22.4 and 22.6.] Because a different amount of training data goes into each 
local neural color rule, each of the neural color rules can be optimized for its own 
territory; that is, each can have a different model size. The initial data classified 
into five color regions are shown in Table 22.5. Note that when the modified bell 
MFs are used, color MFs metamorphose as learning progresses, and therefore the 
amount of data in the five color categories changes. On the other hand, there is 
another idea that each color expert should have the same model size, so we tested 
CANFIS with neural color rules that have the same model size [see CANFIS (d) in 
Tables 22.4 and 22.6]. 

We show the results from five representative CANFIS models in Table 22.4. 
Table 22.6 shows a performance comparison among those CANFIS models as well 
as MLP models discussed in Section 22.3. 


22.5 COLOR PAINT MANUFACTURING INTELLIGENCE 

The discussed CANFIS served to build the color recipe prediction system shown in 
Figure 22.1. When we see the paint production environment around the prediction 
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Table 22.5. Optimal structures of five color neural rules in CANFIS (e) in Ta- 
bles 22.4 and 22.6, and their initial number of training/test data. The structures 
were heuristically optimized. Each neural rule ’s output units are fewer than the final 
10 CANFIS output units. 


Five color 
neural rules 

Model size 

Training 

data 

Checking 

data 

Rulep^ 

16 x 16 x 16 x 8 

650 

138 

Ruley (fi ]] ow 

16 x 16 x 17 x 8 

707 

200 

RuleQreen 

16 x 21 x 7 

521 

105 

RuleRiup 

16 x 15 x 8 

363 

65 

Ruleyioiet 

16 x 17 x 6 

409 

48 


Table 22.6. Performance comparison between single MLP models: NNnorm and 
NN mo d> an d fi ve representative CANFIS models. Table 22.4 details the five CAN- 
FIS models. NNnorm was a simple MLP approach, and NN moc i was an improved 
NN norm • A column, “Error,” denotes the average checking error. A column, 
“Para. #, ” denotes the total modifiable parameter number. 


CANFIS 

# of membership functions 

Rule 

Error 

xlO -2 

Specified # 
of pigments 

Para. 

# 

hue 

lightness 

chroma 

no. 

formation 

(a) 

5 

0 

0 

5 

sigmoidal 

7.99 

3.46 

852 

(b) 

15 

0 

0 

15 

linear 

12.90 

2.78 

2,595 

(c) 

5 

3 

3 

45 

linear 

2.59 

3.76 

7,683 

(d) 

5 

0 

0 

5 

neural 

1.90 

3.85 

3,035 

(e) 

5 

0 

0 

5 

neural 

1.41 

4.00 

2,691 

NNnorm 

— 

— 

— 

1 

neural 

2.62 

6.66 

925 

NN mo <± 

— 

— 

— 

1 

neural 

2.03 

3.90 

925 


system, we notice the color paint manufacturing cycle illustrated in Figure 22.5. 
Basically, the main focus in recipe prediction should be color difference rather than 
colorant errors. Practically, the color difference defined in Equation (22.3) between 
pairs of presented colors should be smaller than about 1.0; human eyes cannot 
distinguish between smaller color differences. This bird’s-eye view of the manu- 
facturing cycle gives us a hint about how to feed back information about color 
difference to improve prediction accuracy. As summarized in Table 22.1, there are 
five major concerns in color recipe prediction. It is important to consider perceived 
color difference during the prediction process; we use an MLP, NN -^^ , to cope 
with the third critical concern (P4) in Table 22.1. 

This section presents a cooperative hybrid system to simulate such an entire 
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Figure 22.5. Color paint manufacturing process. 


manufacturing process in an attempt to construct manufacturing intelligence 1 for 
the color paint industry; we integrate the three major elements of soft computing 
and problem-specific knowledge. That is, NNs, an FS, and a GA with a KB com- 
plement each other in obtaining more precise outputs for color recipe prediction 
through manufacturing simulation based on the entire decision-making process of 
a professional colorist. Here, the GA plays a leading role in this fusion system 
by evolving colorant proportion vectors. Because (P2) in Table 22.1 is a kind of 
combinational problem and an evolutionary framework is necessary, the GA may 
be a good choice for a leading light at this stage. We shall clarify the evolutionary 
framework in subsequent sections. 

22.5.1 Manufacturing Intelligence Architecture 

In the initial stage, the first-generation population or starting points for a GA search 
are set by a fuzzy population generator and a multi-elite generator using results 
from the CANFIS and NN approaches. Those results must already be somewhat 
close to the range of ideal colorant concentrations. In the evolutionary phase, the 
fusion system tries to improve those encoded proportion members in conjunction 
with NNs and a KB; that is, two different NNs and a KB are used to make up 
the fitness function. Genes’ colorant concentrations are passed to three functions, 
which calculate fitness values individually. The three values are combined into the 

x Here, the term manufacturing intelligence is intended as soft computing (or computational) 
intelligence for simulating a paint manufacturing process in this chapter. In contrast, Wright 
and Bourne discussed manufacturing intelligence in a broader sense for the science of creating 
intelligent systems for manufacturing applications involving hardware systems [ 13 ]. 
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Figure 22.6. Architecture of color paint manufacturing intelligence . 


final fitness value. In the following subsections, we shed light on more details of 
this evolutionary mechanism, illustrated in Figure 22.6. 


22.5.2 Knowledge Base 

Knowledge may be useful in reinforcing some favorable aspects of genetic searches [7] . 
Performing the color recipe prediction task requires special knowledge. We believe 
that a KB plays an important role in helping the system evolve to recognize specific 
features of a target color. The KB has the following main rules: 

Rule 1: Keep total proportions of colorants around 100%, 

Rule 2: Keep the number of necessary colorants around the ideal number, 

Rule 3: Avoid use of complementary colorants: e.g., Red and Green, 

Rule 4: Avoid use of the same type of colorants at the same time: 

(e.g., Redi and Red 2 ). 

Note that we have 10 colorants (10 outputs) that include three pairs of the same 
kind of colorants: green, yellow, and red ones (see Figure 22.1); each pair, such as 
Redi and Red 2 , has different characteristics. In this prediction task, the 100% rule 
(rule 1) was emphasized, as in ref. [11]. 
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22.5.3 Multi-elites Generator 

CANFIS and NN approach results axe encoded into the initial population as elite 
members. Then a multi-elite generator produces more elites by modifying those 
results according to rule 4 in the KB. That is, the concentrations of the same 
type of colorants are summed into one or another of them (e.g., Redi -I- Red 2 => 

Redi, or Redi + Red 2 => Red 2 ). This is derived from the fact that the simple 
backpropagation MLP, NN n o rm> tends to specify use of more than six colorants 
(see Table 22.3), although the desired number of colorants to produce any color 
in our data sets is fewer than five. Again, it is important to keep the number 
of colorants used at a practical level. Multiple elite colorant vectors offer several 
different starting points for GA searches. The number of encoded elites depends 
on the quality of the CANFIS/NN results; we take the results of three approaches 
(A’TVnorm, -^-^mod’ CANFIS), and so we have at least three elite members at 
the initial stage. The combination of several solutions may be effective in finding 
the optimal solution [6]. The other members are initialized by a fuzzy population 
generator. This seeding procedure is shown in Figure 22.6 (left). 

22.5.4 Fuzzy Population Generator 

The idea is to generate the initial population according to the fuzzy classification 
of a target color, which serves to determine color selection. First, we classify the 
target color into one of five color categories (red, yellow, green, blue, and violet) on 
the a*-b* plane, which shows hue and chroma [4, 14] (see also Section 22.5.5), and 
decide to what extent the desired color belongs to each color category using fuzzy 
MFs, as discussed in Section 22.4.1. We then generate initial color chromosomes 
by modifying chromosomes generated by a random number generator according 
to rules in the KB. For example, when a target color looks greenish yellow, green 
chromosomes and yellow ones are generated; green chromosomes have zero values in 
either Greeni or Green 2 colorant concentrations and in red colorant concentrations 
because of the red-green complementary color relationship (see rule 3 and rule 4 
in Section 22.5.2). It is effective to inactivate some genes which have information 
on the same type of colorants and complementary colorants to eliminate redundant 
colorants at the initial stage. 

The number of green chromosomes (NumQ reen ) and that of yellow ones (Num Yellow) 
are decided according to the following calculations: 


^°Prest 

= F °Ptotal - F °PNN- 

i\TuraQ reen 

_ MgPOPfpsf 

My+Mg ’ 

Num Ye \i ow 

_ Mj/Poprest, 

My+Mg ’ 


where two membership values, M y and M g , signify to what extent the target color 
belongs to the yellow category and the green one, respectively. Pop totg j denotes 
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Figure 22.7. A component of the fitness function based on NNp^g. 


the total population number, and Pop^^ signifies the number of elite chromosomes 
from the CANFIS/NN results, including the chromosomes generated by the multi- 
elite generator. 


22.5.5 Fitness Function 

The fitness function consists of three functions: two neural fitness functions (func- 
tion 1 and function 3) and the KB-based fitness function (function 2). 


Function 1 

Using TViVpig, the first function evaluates genes’ colorant concentration vectors ac- 
cording to the specified use of colorants. The NN^g (16 x 18 x 21 x 10 neurons) 
maps surface spectral reflectance to a list of required colorants (see Figure 22.7). 
It gives just ON/OFF values to each output unit to predict which colorants should 
be used to produce the same color as the target color, where ON means “colorant 
needed” and OFF means “not needed.” Function 1 evaluates each chromosome by 
calculating a distance in binary space (ON/OFF) after each chromosome’s repre- 
sentation has been transformed into the ON/ OFF format. Figure 22.7 describes 
this procedure. Table 22.7 shows the capability of this trained NN^g. 


Function 2 

The second function calculates a fitness value based on the KB described in Sec- 
tion 22.5.2. The fitness value depends on the extent to which genes’ colorant con- 
centration vector obeys the rules in the KB. To keep the GA search moving in a 
consistent direction, the KB is used in both the initial stage and in the calculation 
of fitness values, as illustrated in Figure 22.6. 
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Function 3 

The third function, based on generates a fitness value with respect to color 

difference between a target color and each member’s color, whose colorant concen- 
trations are predicted by the system. Because it is time-consuming to manufacture 
an actual color by mixing genes’ specified values (see Figure 22.5), the 
plays a crucial role as a color simulator to predict what color will be produced. The 
jVTVLab (10x11x14x3 neurons) maps colorant concentrations to L*, a*, and 
6*; that is, by plugging each member’s colorant proportions into we can 

obtain L*, a*, and b* to calculate the color difference between a target color and 
an individual color (see Figure 22.8). 

We adopted CIE 1976 (L*,a*,6*)- space [4, 14]. This defines the color difference 
and perceptual attributes of color— lightness, hue, and chroma — as detailed here: 


Color difference 

Lightness 

Hue 

Chroma 


v 

arctan (b* /a*) 
\/(a*) 2 + (6*) 2 , 


- a*) 2 + (b* t - b*) 2 


(22.3) 


where L*, a*, and b* are calculated according to surface spectral reflectance and 
{L* t ,a,t,b* t ) are the values of a target color. Note that any color can be uniquely 
identified by its surface spectral reflectance curve (i.e., its physical color attribute). 

The calculated color difference shows how satisfactorily the predicted color 
matches the reference color. The use of -N^Lab provides a way to take into account 
human visual sensitivity to color difference. Table 22.8 shows the potential of the 
color simulator NN-^^. 

Function 3 determines the fitness value (fitness 3 ) of each chromosome, according 
to the calculated color difference, E : 


fitness 3 = exp(— E). 


(22.4) 
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Table 22.7. Capabilities of different NN approaches in specifying necessary col- 
orants. NNnorm is a simple backpropagation MLP, and NN mo ^ is an improved 
NNnorm as discussed in Section 22.3; CANFIS is a neuro-fuzzy model described 
in Section 22. 4- NNpjg is a special NN that predicts necessary colorants as shown 
in Figure 22.7. 



ATATnorm 

-^-^mod 

CANFIS 

NN • 

JVJV pig 

27 

# of unmatched patterns 
in 302 test patterns 

299 

74 

73 

# of unmatched units 
in 3020 output units 

911 

106 

98 

48 

Predicted ave. # of 
required colorants 

6.66 

3.90 

3.89 

3.96 


Table 22.8. Average color difference predicted by NN using ideal colorant 
concentrations, and the results of three NN approaches: CANFIS, NN mQ ^, and 
NNnorm • CANFIS is a neuro-fuzzy model described in Section 22.4; NNnorm is a 
simple backpropagation MLP; and NN mo ^ is an improved NNnorm as discussed in 
Section 22.3. This table shows the potential capability of NN^^ for 302 checking 
data. 


Colorant vectors 

Ideal 

CANFIS 

^^mod 

ATATnorm 

Color difference 

0.567 

1.976 

2.847 

5.921 


This function was designed to produce higher fitness inversely proportional to the 
magnitude of color difference. 


22.5.6 Genetic Strategies 

Genetic operations have a significant impact on the quality of solutions. We have 
embodied some ideas in both crossover operations and mutation operations. 
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Figure 22.9. GA search control by the modified simplex crossover. 

Modified Simplex Crossover 

We have modified the selection scheme in performing simplex crossover operations 2 
Our modified selection uses the following three procedures: 

1. Select one good chromosome with respect to fitness value. 

2. Pick, with high probability, an elite member (i.e., one of the mutant copies 
from the initial CANFIS/NN results) as a good chromosome. 

3. Choose one bad chromosome with respect to fitness value. 

The procedures share an idea of the downhill simplex method [10], based on a 
reflection away from a bad chromosome. This method may provide a better GA 
search direction, as illustrated in Figure 22.9. When we have a neural fitness func- 
tion, we may have a problem; a trained NN may not be a perfect fitness function. 
Indeed, both NN^g and NN-^^ in the fitness function are not perfect, as shown 
in Tables 22.7 and 22.8, but they may be able to direct a blind GA search to a 
better region of the search space. Procedure 1 lights a direction toward minimizing 
color difference. In accordance with a chromosome with higher fitness may 

2 The simplex crossover proposed by Bersini and Seront [ 1 ] consists of the following selection 
and bit-assignment to produce a new child chromosome Cnew: 

• Selection 

Choose randomly three chromosomes, and arrange them in the decreasing order with 
respect to their fitness values. (Name them Ci, C2, and C3 in that order.) 

• Bit-assignment 

If the ith bit of C\ is equal to the zth bit of C2, it will be assigned to the zth bit of 
Cnew- Otherwise, the inverse of the ith bit of C3 will be assigned to the ith bit of 
Cnew- 
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Figure 22.10. Exchanging mutation. 


have smaller color difference. In this GA search, it is desirable to find a direction 
that minimizes both color difference and colorant errors. The problem is that we 
cannot calculate colorant errors directly. Yet the CANFIS/NN results provide a 
clue about better colorant concentrations since they must already be within some 
range of the ideal colorant concentrations. That is why mutant copies from the 
CANFIS/NN results, including ones originally generated by the multi-elite gener- 
ator, should be involved in guiding the search toward better colorant proportion 
vectors, as in procedure 2. 

Mutation Strategy 

Usual mutation operation, as in a simple GA [3, 12], is applied to all members with 
a changeable mutation rate scheme such that a fixed mutation rate (0.01) is adopted 
with a probability of 0.4, and otherwise a mutation rate ranging from 0.09 to 0.69 
is decided using a random number. Moreover, the following modified operations are 
also considered: 

• Chromosome template. To avoid specifying the use of more colorants than 
necessary, we set out to inactivate some genes using the fuzzy population 
generator, as described in Section 22.5.4. This has made it possible to use a 
chromosome itself as a template to do the mutation operation. Namely, before 
the mutation operation, it is decided whether to mutate an inactivated gene or 
not; the mutation is applied with low probability (0.1) to inactivated genes, 
which have zero values of concentrations after decoding the genes’ binary 
representations into colorant concentrations. If the mutation is applied to an 
inactivated gene, this leads to an increase in the number of necessary colorants. 

• Local search and preservation of multi-elites. Multi-elites (i.e., chromosomes 
from the results of CANFIS/NN approaches) axe mutated only at the lower 
bits of each gene to keep traits similar to the NN results; those mutant copies 
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Table 22.9. Results of computational prediction in colorant error (xlO~ 2 ) and 
the corresponding color difference predicted by using 111 checking data. 

NNnorm is a simple backpropagation MLP, and NN mo( i is an improved N N n0 rm, 
as discussed in Section 22.3; CANFIS is a neuro-fuzzy model described in Sec- 
tion 22.4; GNF is a genetic neuro-fuzzy model. 



Ideal 

NiVnorm 

•^•^mod 

CANFIS 

GNF X TI , 

Colorant error 
(xlO -2 ) 

0 

2.312 

1.543 

1.139 

0.643 

Color difference 
predicted by NNj ^ 

0.588 

6.661 

3.165 

2.019 

0.267 


of the multi-elites may stay in the vicinity of the original multi-elites. In 
this way, local search of the NN results is realized. In addition, the offspring 
of multi-elites always advances to the next generation; the mutant copies of 
multi-elites axe preserved throughout the entire evolution. Note that this 
manipulation of low-order bits is applied only to multi-elites. 

• Exchanging mutation. After the usual mutation, with low probability, mem- 
bers are subjected to another mutation: exchanging genes that have the same 
type of colorant information. This mutation is illustrated in Figure 22.10. 
Among 10 output colorant proportions, we have three pairs of the same types 
of colorants, such as Redi and Red 2 , which have different natures; we must 
decide which one to use. This exchanging mutation helps us to explore such 
colorant choices. This may lead to an escape from local optima in the initial 
CANFIS/NN and NN^g results; their choices may not match the final choice 
determined by the system. The agreement with NN^g in Table 22.10 shows 
how much the predicted choice of colorants optimized by the system matched 
the colorant choices specified by NN^g. 

22.6 EXPERIMENTAL EVALUATION 

To evaluate the capability of the hybrid system, color paint manufacturing in- 
telligence, we used 111 randomly selected checking data. The configuration of the 
GA was as follows: 


Population size 
Mutation rate 
Crossover method 
Simplex crossover rate 
Maximum generations 


80 members 

flexible (see Section 22.5.6) 
simplex crossover [1] 

0.85 

10 , 000 . 
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Table 22.10. Performance evaluation in computational prediction of 111 check- 
ing data. GNFall shows the maximal ability of manufacturing intelligence in 
the prediction task. Parenthesized values denote potential capabilities with respect 
to colorant errors; they were obtained when colorant errors were minimized. The 
columns, from left to right, indicate “components of fitness functions,” “average 
number of generations when the solution chromosome appeared,” “colorant error 
(xlO -2 ) , ” “accord with NNpig,” and “color difference predicted by the color sim- 
ulator, ” respectively. Note that the column, “Accord with NNpig, ” shows 

how much the predicted choice of colorants optimized by the GNF models match the 
colorant choice specified by NNpig. 



Fitness functions 

Ave. # 
of gen. 

* H 
o i-s 
l o 

Ave. # 
of col. 

- 

Accord 
w/ NNpig 

Col. dif. 

by NNLab 

NNLab 

NNpig 

KB 

GNFall 

o 

o 

o 

5058.7 

(4759.5) 

0.643 

(0.213) 

3.90 

(3.88) 

79.28% 

(79.28%) 

0.267 

(0.809) 

GNF void 

o 

o 

o 

4134.4 

(3082.8) 

72.209 

(36.165) 

3.94 

(4.89) 

50.45% 

(21.62%) 

48.800 

(34.637) 

GNFcp 

o 

o 

X 

4915.4 

(4358.8) 

1.190 

(0.206) 

4.02 

(4.02) 

78.38% 

(74.77%) 

0.121 

(0.659) 

GNFck 

o 

X 

o 

4559.3 

(4655.5) 

1.695 

(0.215) 

3.88 

(3.89) 

74.77% 

(77.48%) 

0.202 

(0.656) 

GNFc 

o 

X 

X 

4604.3 

(4742.4) 

2.802 

(0.191) 

5.35 

(4.36) 

28.83% 

(55.86%) 

0.060 

(0.567) 


Table 22.9 shows the compaxison of our proposed model, GNFp^^, and some 
other approaches; iViVnorm was a simple MLP approach, and NN mQ( j was an 
improved MLP model that had modified sigmoidal functions in the output layer [9] . 
Both had the same model size (16 x 18 x 21 x 10 neurons) (see Section 22.3). 
GNFj^jj^, with all three components of the fitness function, employed the results 
of three approaches— NNnorm, N -^mod ’ an( * CANFIS — in producing the initial 
population. According to the corresponding color difference predicted by NN^^, 
only the result of GNFj ^ pp was good enough to reach a satisfactory level of color 
difference where human eyes could not tell the difference between presented colors. 
(Again note that the desired color difference had to be smaller than about 1.0.) 

Furthermore, to demonstrate the validity of each of the three components in 
the fitness function, we tested GNFc, GNFcp, and GNFck • GNFc had NN^^ 
as the only component of the fitness function. GNFcp had both NNj^^ and 
iViVpig as two components of the fitness function. GNFq ^ had the KB as well 
as NN^gfo as two components. (GNFpjj^ had all three components.) Note that 
iViVpab played an important role as a color simulator, and so it always had to stay 
in the fitness function. Table 22.10 shows how each component contributed to the 
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Figure 22.11. Complicated relationship of 32 color samples between actual color 
difference and colorant errors ; these 32 sample color paints were actually manufac- 
tured. 


prediction, and how they complemented each other. 

Additionally, to exhibit how indispensable CANFIS/NN results are at the initial 
seeding stage, we examined GNF voi d , which had no multi-elites from CANFIS/NN 
results, but had the same fitness function as GNFall • It started the GA search 
from the randomly initialized points of colorant proportion vectors. In Table 22.10, 
we see the potential capability of the five models; the values in parentheses signify 
the best performance with respect to colorant errors, regardless of fitness; they were 
obtained when colorant errors were minimized. 

22.7 DISCUSSION 

Usually, it is difficult to construct an ideal “fine-tuner” by using the GA; we discuss 
a fusion technique and specific modifications applied to the fitness function and GA 
operations. The GA alone may not be a good fine-tuner, yet the complete resulting 
system can be viewed as a fine-tuner, overcoming individual limitations; the key 
idea is that components complement each other. In this context, the GA may not 
be the only choice to play the central role in conjunction with other components of 
soft computing. 

When acted alone as the fitness function employed in GNFc , the system 

tended to go too far toward minimizing color difference, and therefore the average 
number of required colorants was larger. In addition, the specified colorants did 
not match well those designated by NN^g, as indicated in the low percentage of 
accord with NN^g in Table 22.10. NN^g was supposed to learn the character- 
istics of colorant compositions, such as complementary color relationships, yet it 
did not perform perfectly (see again Table 22.7). Thus, the KB surely helps the 
system evolve to recognize more colorant features. It must be emphasized that they 
functioned synergistically. 

Both -NiVpig and NNj^^ as components of the fitness function made some 
prediction errors, as shown in Tables 22.7 and 22.8. Hence, enhancing the overall 
performance may result from improving the accuracy of such neural fitness func- 
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tions. Even though NN-^^ lacks high precision, the performance of GNFall is 
still better than those of other approaches; this verifies that the strategy of the 
search direction depicted in Figure 22.9 is trustworthy. 

GNF y0 had no multi-elites but had all three components of the fitness func- 
tion, starting from the randomly initialized colorant concentration vectors. Its poor 
performance emphasizes the existence of the multi-elites (i.e., mutant copies from 
the CANFIS/NN approach results); without them, we cannot draw any advan- 
tage from the search direction based on the idea of the downhill Simplex method, 
summarized in Figure 22.9. In other words, seedings from CANFIS or other NN 
approaches are indispensable in enabling manufacturing intelligence to function effi- 
ciently within a reasonable amount of computation time. Also, such extraordinarily 
poor performance in predicted color difference implies that Equation (22.4) in the 
fitness calculation of color difference may be altered for such no-seeding cases. 

As shown by the values in parentheses in Table 22.10, the system did not put 
the highest fitness on the best chromosome in terms , of colorant errors. In this 
simulation, we did not use the elitist selection method since the fitness function 
could not calculate colorant errors, which may suggest that even if a better child 
chromosome in terms of colorant error appears, the elitist strategy may jeopardize 
its chance of advancement to the next generation [9]. Practically, in our strategy, 
the system selects a chromosome with the highest fitness as the final solution over 
a preset number of generations. The column “Ave. # of gen.,” or average number 
of generations, in Table 22.10 indicates when the solution chromosome appeared 
during the evolutionary process. 

The colorant errors in parentheses in Table 22.10 show almost the same error 
level (0.2 xlO -2 ), except for that of GNF yo ^, although the colorant errors of the 
final solutions of the system are different. Actually, only 71 patterns among the 
111 checking patterns were improved in terms of colorant error. This may be partly 
because the CANFIS models did a good job in prediction, so their results may be 
hard to improve on, but partly also because the system may happen to find another 
colorant composition solution. The presented manufacturing intelligence may find 
another solution if a color simulator, learns much of the mapping from 

colorant compositions to perceptual attributes of color ( L *, a*, and b*). 

Figure 22.11 shows an interesting fact that the real perceived color difference 
did not exactly correspond to the magnitude of colorant errors; those data were 
collected by manufacturing 32 color paint samples. Such complicated relationships 
between colorant errors and actual color differences may imply that the mapping 
from surface spectral reflectance to a list of colorants may not be a one-to-one 
correspondence. [As stated in Table 22.1, we may need to take care of the (P3) and 
(P5) problems; different colorant compositions may produce the same or almost the 
same color to human perception.] 

One possible explanation of this finding can be seen in the following chromosome 
examples. Suppose we need to produce a target color whose ideal colorant com- 
positions include White, Redi, and Yellow 2 , and two colorant proportion strings 
(Candidatei and Candidate 2 ) exist in the population. Further suppose Candidate 2 
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has a smaller color difference predicted by NN than Candidatei , but Candidate 2 
has a bigger colorant error due to its different colorant compositions: White, Red 2 , 
and Yellowi , as shown in the following table: 



White 

Redi 

Red2 

Yellow i 

Yellow 2 

Colorant 

error 

Color 

difference 

Ideal vector 

0.8101 

0.0552 

0 

0 

0.1347 

0 

0 

Candidatei 

0.7821 

0.1631 

0 

0 

0.0548 

Smaller 

Larger 

Candidate 2 

0.7954 

0 

0.0719 

0.1327 

0 

Larger 

Smaller 


In this case, Candidate 2 most likely will have a higher fitness value than Candidatei 
despite its colorant choice variation from the ideal. Therefore, the system may pick 
Candidate 2 as a solution. This is the trick. If the resulting manufactured color of 
Candidate 2 has a small enough color difference that human eyes cannot discern, 
Candidate 2 will be acceptable. While MLP and CANFIS approaches cannot deal 
with this “another solution problem” without suitable modifications, the manufac- 
turing intelligence can handle it. In other words, the resulting system can function 
beyond the fine-tunner’s reach. To draw more convincing conclusions, we must 
explore further resultant colorant vectors by checking if different combinations of 
colorants really have the same perceptual attributes of color as seen by humans [i.e., 
(P5) in Table 22.1]. 

This section concludes with one notice of accuracy of the color difference formula 
defined in Equation (22.3): The adopted CIE 1976 space is not perfect. 

In color science, it is still important to characterize the nature of human color vision. 


22.8 CONCLUDING REMARKS AND FUTURE DIRECTIONS 

In Section 22.4, we demonstrated the strength of a knowledge-embedded CANFIS. 
By constructing MFs in color attribute space, this neuro-fuzzy approach allows us to 
express and realize meaningful and concise representations of colorists’ knowledge. 
Of course, we must consider whether there is a more effective way to represent 
human visual sensitivity to color in the space of perceptual attributes than the use 
of bell-shaped MFs; we may need to contrive a more sophisticated MF. In structural 
terms of CANFIS, when we used CANFIS with 11 MFs in Figure 22.4, we had 45 
rules, and therefore it seems difficult to introduce neural rules or local color expert 
NNs. Imagine a huge CANFIS construction with 45 neural rules. Here it must be 
important to determine what rule formation is appropriate in the sense of practical 
feasibility. To confront these concerns must be our next step, which may endow a 
breakthrough in understanding the neuro-fuzzy modeling. We believe such efforts 
will pave the way for a new generation of CANFIS. 

In Section 22.5, we presented manufacturing intelligence based on a unique blend 
of principal components of soft computing, where a GA with a KB plays a leading 
role in pursuit of predictions, linking an FS and NNs; they function complementarity 
as a system rather than competitively. In light of both potential versatility and 
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Figure 22.12. A rough sketch of the paint industry. 

practical validity, such hybrid systems can yield advantages over other individual 
approaches in that they can evolve their results. 

To focus on color difference, we endeavored to simulate the manufacturing cy- 
cle of color paint, which involved the color recipe prediction as a process. The 
manufacturing intelligence has a mechanism for checking predicted perceptual color 
difference with an embedded color simulator, It realized a higher degree 

of prediction precision by evolving the results of other approaches. This finding 
confirms the concept of manufacturing intelligence based on soft computing and 
provides a small but potentially significant impact on our future research. 

We are not claiming a tremendous success in the small application example 
presented in this chapter. Look at the rough sketch of the paint industry in Fig- 
ure 22.12, where lots of new challenges can be sensed. The color recipe prediction 
application is actually a tiny cogwheel in the machine of the paint industry. In a 
practical industry, there must be huge numbers of applications for computational 
intelligence. 

Now let us step back, and see our whole painted picture of soft computing in- 
telligence in this chapter; at present, it may look premature. Yet it is felt to be 
growing steadly toward making a lasting impact on future technology. We believe 
that such computational intelligence and technological wizardry must be a match 
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for any gauntlet the industrial world may throw down. Moreover, we hope that new 
ideas emerging from these studies will eventually stimulate engineers and scientists 
in ways we cannot now imagine, and that such computationally intelligent systems 
will pass stringent tests with flying colors. 
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Appendix A 


Hints to Selected Exercises 


Chapter 2 
15. and 16. 


20. (a). Let 


width(A Q ) = 2a 


y = (a~ p + b~ p - I)” 1 /*. 


Then find linip-^o lny. 


Chapter 3 

5. Modify the MATLAB file complv.m. 

6. Modify the MATLAB file complv.m. 

11. Modify the MATLAB file implicat.m. 

Chapter 4 

9. Modify the MATLAB file sug2.m. 

10. Modify the MATLAB file sug2.m. 


Chapter 5 


3. Remember that 0 T A 0 = 0 T B0 if B = (A 4- A T )/2. 


8. Show that 


P*+ia* + i = 


Pfcfrfc+i 

1 + a fc+iPfc a fc+i ’ 
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Hints to Selected Exercises 


Chapter 6 

1. Show E{0 now 4- J/d) — -^(f^now) < 0. 

4. See Algorithm 6.2. 

Chapter 7 

3. • Vectorize the calculation of the distance matrix. 

• Calculate dE based only on the links that have been changed. 

Chapter 9 

1. Consider how to represent the three boundary lines of area T: X = 0, Y = 0, 
and X + Y — 1 = 0. 

12. Try the MATLAB file rbfn.m available via FTP or WWW (see Preface). 

Chapter 12 

15. Modify the MATLAB file trn_4in.m. 

16. Modify the MATLAB file trn_4in.m. 


Chapter 13 

1. It may suffer from slow convergence due to an increase of adjustable parameters. 


Chapter 14 

10. Use the Schwartz inequality 

(A 2 + B 2 )(C 2 + D 2 ) > {AC 4- BD) 2 . 

11. Modify go.cart .m and other related files. (These related files can be found by 

pattern-matching commands; for instance, “grep CART *.m” under UNIX.) 
Note that to change to local linear models, you need to reorganize the main 
book-keeping table CART_table. See cartmain.m for more details. 


Chapter 17 
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1 . For the input signal to be able to drive the plant from any initial states x(Ar) to 
any final states x(& + r) in r steps, the controllability matrix 


W = [A r_1 B A r-2 B - ABB] 


must be of rank n. 


Chapter 18 

2. The crossover must be executed at a point where either the left substrings or 
the right substrings have the same numbers of l’s. 
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List of Internet Resources 


Due to the rapid advances of network technology, the Internet has become a major 
source of all kind of information needs. This appendix lists information resources of 
neuro-fuzzy and soft computing that are accessible via the Internet. The information 
listed here is by no means complete; an up-to-date version is available from the 
book’s homepage at 

http : //www . cs . nthu . edu . tw/~ j ang/sof t . htm 

ENTRY POINTS 

The following URL addresses provide convenient entry points for the search of 
information on neuro-fuzzy and soft computing: 

This Book’s Homepage: http://www.cs.nthu.edu.tw/~jang/soft.htm 

Fuzzy Logic and Neurofuzzy Resources: 

http : //www- isis . ecs . sot on . ac . uk/ research/ nf info/ fuzzy . html 

PEOPLE 

This is a list of “fuzzy” people in random order. 

Who’s Who in Fuzzy Community: ftp://ftp.abo.fi/pub/i21msr/who.txt 

Lotfi Zadeh: http : //http . cs . berkeley . edu/csbrochure/ f aculty/zadeh . html 

Ronald R. Yager: http : //www . Iona . edu/ rry . htm 

George Klir: http://ssie.binghamton.edu/people/klir.html 

Jerry Mendel: http://sipi.usc.edu/faculty/mendel.html 

Bart Kosko: http://sipi.usc.edu/faculty/kosko.html 

James M. Keller: http://www.missouri.edu/~ecewww/keller.html 
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Kevin M. Passino: http ://eewww. eng. ohio-state. edu/ ~passi.no/ 

Martin Brown: http://www-isis.ecs.soton.ac.uk/people/m_brown.html 
John Yen: http://www.cs.tamu.edu/faculty/yen/ 

Reza Langari: http://ACS.TAMU.EDU/~meen/2375.html 

Hao Ying: http://www.cs.tamu.edu/research/CFL/people/ying.html 

Yung- Yaw Chen: http://ipmc.ee.ntu.edu.tw/~yychen/ 

Li-Xin Wang: http://www.ee.ust.hk/~eewang/ 

Michael Lee: http://HTTP.CS.Berkeley.EDU/~leem/ 

J.-S. Roger Jang: http://www.cs.nthu.edu.tw/~jang/ 

NEWSGROUPS 

comp.ai.fuzzy: Newsgroup about fuzzy logic and fuzzy set theory, 
comp.ai.neural-nets: Newsgroup about neural networks, 
comp.ai.genetic: Newsgroup about genetic algorithms, 
comp.soft-sys.matlab: Newsgroup about MATLAB and SlMULINK. 

SOFTWARE 

This is a partial list of commercial and public-domain software for fuzzy logic and 
neuro-fuzzy applications. A comprehensive list can be found at 

http : //www-isis . ecs . soton . ac . uk/research/nf inf o/f zswar e . html 

Fuzzy Logic Toolbox: General information at 

http://www.mathworks.com/fuzzytbx.html, User’s page at 
http : //www . mathworks . com/f uzzyupdate . html 

NEFCLASS: A neuro-fuzzy classifier at 

http : //www . cs . tu-bs . de/ ibr/proj ects/nef con/nef class .htm 

Machine Learning Library in C4~ H http://www.sgi.com/Technology/mlc 

FIR/TDNN Toolbox Toolbox for finite impulse response (FIR) and TDNN (time- 
delay neural networks); http://www.eeap.ogi.edu/~ericwan/fir.html 

NN-based System Identification Toolbox: System ID toolbox using neural net- 
works; http : //kalman . iau. dtu. dk/Project s/pro j /nnsysid. html 
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GENESIS Neural Simulator: A neural networks simulation package at 
http : / / ww . bbb . calt ech . edu/GENESIS 

PDP-H- Software: NN Simulation System in C++, at 

http : / / www . cs . emu . edu/Web/Groups/CNBC/PDP++/PDP++ . html 

DATA SETS 

UCI Machine Learning Repository: 

http : / / www . ics . uci . edu/~mlearn/MLSummary . html 

Time Series Repository: 

http : / / www . cs . Colorado . edu/~andreas/Time-Ser ies/TSWelcome . html 
Face Detection Data Set: 

http : / / www . ius . cs . emu . edu/IUS/ dylan_usrO/har/ f aces/test/index . html 
DELVE Data Set: 

http : //www . cs . utoronto . ca/neuron/delve/delve . html 

JOURNALS 

IEEE Transactions on Fuzzy Systems: 

http : / / www . ieee . org/pub_preview/f uzz_t oc . html 

IEEE Transactions on Neural Networks: 

http : / / www . eeb . electronic . tue . nl/ neural/contents/ieee_trans_on_nn . html 

Fuzzy Sets and Systems: 

http : //www . elsevier.nl/catalogue/SAE/515/08410/08417/505545/505545.html 

International Journal of Approximate Reasoning: 

http : / / seraphim . csee . usf . edu/Naf ips/i j ar . html 

International Journal of Neural Systems: 

http : / / www . wspc . co . uk/wspc/ Journals/i j ns/i jns . html 

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems: 
http : / / www . wspc . co . uk/ wspc/ Journals/i j uf ks/i j uf ks . html/ 

RESEARCH GROUPS 

There axe more than 100 research groups all over the world working on neuro-fuzzy 
and soft computing. A detailed list can be found at 

http : / / www-isis . ecs . soton . ac . uk/ research/nf inf o/f zrgroup . html 
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List of MATLAB Programs 


This is a list of major MATLAB programs used in this book. These MATLAB pro- 
grams may invoke other auxiliary MATLAB programs not listed here but available 
via the same FTP or WWW mentioned on page xxiii. Files labeled with stars in- 
dicate that they rely on functions in the Fuzzy Logic Toolbox, which are not freely 
available. 

activati.m: page 234 
allbells. m: page 27 
bfgs.m: page 155 
bellmanu.m: pages 27 and 46 
bjpick2.m*: page 515 
bjtrain.m*: page 516 
carterr.m: page 416 
cartmf.m: page 418 
cg.m: page 155 
compball . m: page 303 
complv . m: page 58 
convexmf . m: page 20 
cri.m: page 65 
cyl.ext.m: page 31 
descent. m: page 138 
disp_sig.m: page 29 
equdata.m: page 521 
equdec.m: page 523 
equdensi.m: page 521 
extensio.m: page 50 
fcmdemo.m*: page 428 
ft surf l.m: page 419 
ftsurf2.m: page 419 
fuzimp.m: pages 62 and 63 
fuzpcr.m*: page 508 
fuzzy. m*: page 80 
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gdconv.m: page 167 
gdssl.m: page 157 
gdss2.m: page 159 
go.cart.m: pages 413 and 414 
go_ga.m: pages 180 and 180 
go_rand.m: page 189 
randsrch.m: pages 188 and 195 
go_simp.m: page 193 
hem.m: page 155 
impurity.m: page 409 
init_mf.m: page 345 
intensif.m: page 59 
invkine.m*: page 510 
invsurf .m: page 509 
inv_f c.m*: pages 464 and 465 
inv_sig.m: page 464 
kfm.m: pages 307, 307, and 308 
lingmf .m: page 17 
lv.m: page 55 
lvqdata.m: page 309 
maml . m: page 78 
mam2.m: page 79 
max.star.m: page 53 
mf2d.m: page 33 
mf_univ.m: page 15 
mount l.m: page 430 
mount2 . m: page 430 
mpgdata.m: page 513 
mpgpick2.m*: page 512 
mpgtrain.m*: pages 513 and 512 
negation.m: page 36 

noisel.m*: pages 526, 527, 528, 529, and 530 

noise2.m*: pages 531, 532, 532, and 533 

nonoise. m: page 518 

optideci.m: page 520 

pchar.m: page 506 

project.m: page 32 

rbfn.m: page 253 

resolut . m: page 43 

siganim.m: page 28 

split s.m: page 412 

spring. m: page 109 

sstnorm.m: page 41 

subset . m: page 23 



sugl.m: page 83 

sug2.m: page 84 

tanxnlp.m: page 252 

taylor.m: page 104 and 108 

t conorm. m: page 39 

tnorm.m: page 38 

eqtrain.m*: pages 522 and 523 

transform. m: page 125 

trn.lin.m*: pages 353, 354, and 355 

trn_2in.m*: pages 348 and 349 

trn_3in.m*: pages 350 and 350 

trn_4in.m*: pages 356, 357, and 357 

tsp.m: page 186 

tsul .m: page 86 

tsurfl.m: page 406 

tsurf2.m: page 406 

xor2dmf .m: page 505 

xordata.m: page 230 

xormf . m: page 504 

x or surf .m*: page 506 

xorsurfl.m: page 231 
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This is a list of acronyms used in this book: 

ACE: Adaptive Critic Element 
AEN: Action Selection Network 
AHC: Adaptive Heuristic Critic 
AHCON: AHC Connectionist 
AI: Artificial Intelligence 

ANFIS: Adaptive Neuro-Fuzzy Inference System 

APE: Average Percentage Error 

AR: Auto-Regressive 

ART: Adaptive Resonance Theory 

ASE: Associative Search Element 

ASN: Action Selection Network 

BOA: Bisector Of Area 

CANFIS: CoActive Neuro-Fuzzy Inference System 

CART: Classification And Regression Tree 

CFR: Calculus of Fuzzy Rules 

CM AC: Cerebellar Model Arithmetic Computer 

CO A: Centroid Of Area 

CPP: Cart and Parallel Poles 

CQ learning: Compositive Q learning 

CRBP: Complementary Reinforcement BackPropagation 

DP: Dynamic Programming 

ECG: ElectroCardioGram 

EECS: Electrical Engineering and Computer Science 

EM learning: Expectation-Maximization learning 

ERL: Evolutionary Reinforcement Learning 

ES: Expert System 

FAM: Fuzzy Associative Memory 

FAQ: Frequently Asked Question 

FC: Fuzzy Control or Fuzzy Controller 

FIS: Fuzzy Inference System 
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FL: Fuzzy Logic 

FLC: Fuzzy Logic Controller 

FS: Fuzzy System 

GA: Genetic Algorithms 

GARIC: Generalized Approximate Reasoning for Intelligent Control 

GDR: Generalized Delta Rule 

GMDH: Group Method of Data Handling 

GMP: Generalized Modus Ponent 

GRNN: General Regression Neural Network 

IEOR: Industrial Engineering and Operations Research 

KB: Knowledge Base 

LMS: Least Mean Squares 

LRF : Localized Receptive Fields 

LS: Least-Squares 

LSE: Least-Squares Estimator 

LVQ: Learning Vector Quantization 

MANFIS: Multiple ANFIS 

MATLAB: MATrix LABoatory 

MENACE: Matchbox Educable Naughts And Crosses Engine 

MF: Membership Function 

MIQ: Machine Intelligence Quotient 

MLP: MultiLayer Perceptron 

MOM: Mean Of Maxima 

MPG: Miles Per Gallon 

MR AC: Model Reference Adaptive Control 

MRH: Multi-Resolution Hierarchies 

NC: NeuroComputing 

NDEI: Non-Dimensional Error Index 

NP: Non-Polynomial 

OCR: Optical Character Recognition 

PC: Personal Computer 

PCA: Principal Component Analysis 

PCR: Printed Character Recognition 

PR: Pattern Recognition or Probabilistic Reasoning 

RBFN: Radial Basis Function Network 

RL: Reinforcement Learning 

RMSE: Root-Mean-Squared Error 

RNN-FLCS: Reinforcement Neural Network-based Fuzzy Logic Control Syste 

RTRL: Real-Time Recurrent Learning 

SA: Simulated Annealing 

SAM: Stochastic Action Modifier 

TLS: Total Least Squares 

TLU: Threshold Logic Unit 

TSP: Traveling Salesperson Problem 
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List of Acronyms Appendix D 


VFSR: Very Fast Simulated Reannealing 
VLSI: Very Large Scale Intergrated circuits 
XOR: Exclusive OR 


Index 


activation function, 228 

hyperbolic function, 233 
identity function, 233 
logistic function, 233 
sigmoidal function, 204, 233 
signum function, 228 
squashing function, 233 
step function, 228 
Adaline, 230 

adaptation gain vector, 114 
adaptive heuristic critic, 273 
action NN, 274 
actor-critic, 273 
value NN, 274 

adaptive linear element (see Adaline ), 
230 

adaptive mixtures (see modular net- 
works), 246 
adaptive network, 200 

adaptation algorithms, 203 
adaptive node, 201 
error measure, 203 
error signal, 207 
feedforward, 201 
fixed node, 201 
hybrid learning 
backward pass, 221 
forward pass, 221 
layered representation, 202 
learning rules, 203 
node function, 200 
node output, 200 


parameter node, 201 
parameter nodes, 212 
parameter sharing problem, 201 
recurrent, 201 

topological ordering representation, 
202 

adaptive network-based fuzzy infer- 
ence system (see ANFIS ), 335 
adaptive neuro-fuzzy inference system 
(see ANFIS), 335 

adaptive resonance theory (ART), 305 
AHC (see adaptive heuristic critic), 
273 

ANFIS, 335, 369 
Applications, 503 
consequent parameters, 337 
hybrid learning, 340, 387 
normalized firing strengths, 337 
premise parameters, 337 
approximation problem, 490 
average relative variance, 357 

backpropagation, 158, 205, 233, 235 
epoch, 210 
epoch size, 210 
learning rate, 209 
step size, 209 
sweep, 210 

backpropagation through time, 213, 
470 

banana function, 195 
batch learning, 210 
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INDEX 


batch learning (see off-line learning ), 

236 

best linear unbiased estimator, 119 
bias term, 187 
binary fuzzy relation, 51 
bisection method, 143 
block form, 98 

BLUE (see best linear unbiased esti- 
mator ), 119 

Boltzmann machine, 327 
Boltzmann machines, 183 
Boltzmann probability distribution, 182 
BP (see backpropagation ), 235 
BPTT (see backpropagation through time ), 
213 

bucket brigade algorithm, 292 

C-means clustering, 424 
C4, 407 

CANFIS, 369, 572 
CART 

complexity parameter, 414 
impurity function, 407 
local model, 412 
splits, 407 

CART (see classification and regres- 
sion trees), 407 
case-based reasoning, 503 
categorical variables, 410 
Cauchy machine, 184 
characteristic function, 14 
checking data set, 97 
classical fuzzy operators, 24 
classification and regression trees, 407 
classification problem, 390 
classification trees, 404 
classifier systems, 292 
coactive neuro-fuzzy inference system 
(see CANFIS ), 369 
color recipe prediction, 568, 569 
committees of networks (see modular 
networks ), 246 
competitive learning, 302 
composite linguistic term, 56 


composition 

max-min, 52 
max- product, 52 
concentration, 56 

conjugate gradient methods, 129, 148, 
152 

Beale-Sorenson’s formula, 152 
conjugacy, 148 
coordinate directions, 149 
Fletcher-Reeves’s formula, 153 
orthogonal descent searches, 149 
orthogonality, 148 
Polak-Ribiere’s formula, 153 
restart algorithm, 153 
connectives, 55 
consequent 

linear rule, 371, 574 
neural rule, 376, 574 
nonlinear, 376 
sigmoidal rule, 376, 574 
contrast intensification, 57 
coordinate descent searches, 149 
cost-complexity measure, 414 
covariance matrix, 313, 519 
credit assignment, 261 
structural, 262 
temporal, 262 

critic (see adaptive heuristic critic ), 
264, 273 

cubic interpolation, 146 
curse of dimensionality, 87 
cybernetics, 5 
cylindrical extension, 30 
base set, 31 
projections, 31 

data density, 520 
data scaling, 237 

input scaling, 237 
output scaling, 237, 381 
decision tree, 404 

external node, 404 
internal node, 404 
leaf, 404 
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terminal node, 404 
defuzzification, 73 
delta rule, 231 
descent methods, 129 

descent direction condition, 131 
gradient-based descent methods, 
130 

design matrix, 104 
dilation, 56 

dimensionality reduction, 314 
distal teacher, 288 

divide-and-conquer methodology, 247, 
290 

domain knowledge, 88 
downhill simplex search, 189 
contraction, 191 
expansion, 191 
shrinkage, 192 
simplex, 189 

DP (see dynamic programming ), 270 
dynamic programming, 270 
approximate DP, 272 
incremental DP, 272 
dynamic skeletonization, 446 

entropy function, 408 
epoch, 236 
e-completeness, 346 
error vector, 105 
error-propagation network, 210 
evaluation functions, 263, 264, 270, 
273, 274, 279 
Manhattan Distance, 263 
exclusive- OR (XOR) problem, 229 
expectation-maximization, 250 
extended Kalman filter algorithm, 223 
extension principle, 47 

fast simulated annealing, 184 
feature maps, 305 
feature selection, 312, 435 
feedback control system, 454 
feedback linearizable, 470, 494 
feedback-linearizable system, 468 


focus set, 447 
focus window, 447 
forgetting factor, 116, 222 
fuzzy associative memory, 73 
fuzzy boxtree, 443 
fuzzy C-means clustering, 425 
fuzzy channels, 536 
fuzzy expert system, 73, 459 
fuzzy filtering, 536 
fuzzy inference system, 73 
aggregate operator, 79 
AND operator, 79 
database, 73 

defuzzification operator, 79 
dictionary, 73 
implication operator, 79 
OR operator, 79 
reasoning mechanism, 73 
rule base, 73 
fuzzy ISODATA, 425 
fuzzy k-d trees, 438 
fuzzy logic control, 453 
fuzzy logic controller, 73, 453 
fuzzy model, 73 
fuzzy modeling, 88 
deep structure, 89 
surface structure, 88 
fuzzy points, 441 
fuzzy relation 

relation matrix, 51 
fuzzy set, 13, 14 
cc-cut, 18 
a-level set, 18 
bandwidth, 20 
Cartesian co-product, 24 
Cartesian product, 24 
closed, 21 
complement, 22 
Sugeno’s, 35 
Yager’s, 36 
contained, 21 
convex, 19 
core, 18 

crossover point, 18 
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DeMorgan’s law, 40 
intersection, 22 
nonrandomness, 16 
normal, 18 
open left, 21 
open right, 21 
resolution principle, 42 
strong cc-cut, 18 
strong cc-level set, 18 
subjectivity, 16 
subset, 21 
support, 17 
symmetric, 20 
union, 21 
universe, 14 

universe of discourse, 14 
width, 20 
fuzzy singleton, 18 
fuzzy system, 73 

fuzzy-filtered neural network, 535 
fuzzy-rule-based system, 73 

GA (see genetic algorithm ), 175 
gain-scheduling fuzzy controller, 489 
Gauss-Markov conditions, 118 
Gauss-Markov theorem, 119 
Gauss-Newton method, 129, 161, 223 
dumped Gauss-Newton method, 
163 

Hartley’s method, 163 
Gauss-Seidel iteration, 149 
Gauss-Southwell method, 149 
Gaussian elimination procedure, 151 
Gaussian function, 374 
Gaussian-bar (hidden) unit, 374 
general learning, 460 
generalization capability, 109 
generalized delta rule, 235 
generalized inverse, 106 
genetic algorithm, 175, 568 
chromosome, 175, 484 
crossover, 176 
crossover rate, 176 
elitism, 178 


fitness function, 582 
fitness value, 175 
gene, 176, 484 
gene pool, 175 
generation, 175 
genetic operators, 175 
Gray coding, 176 
mutation, 177, 586 
mutation rate, 177 
one-point crossover, 176 
population, 175 
simplex crossover, 585 
survived of the fittest, 176 
two-point crossover, 176 
Gini diversity index, 408 
golden section search, 144 
Goldstein test, 147 

graded learning (see reinforcement learn- 
ing), 258 
gradient, 100, 130 

deflected gradients, 132 
gradient method (see steepest descent 
method ), 133 

Gram-Schmidt orthogonalization, 150 
grid partition, 86, 374, 572 
Guillotine cuts, 438 

hat operator, 112 
Hebbian learning, 310 
hedges, 55 

Hessian (see Hessian matrix ), 100 
Hessian matrix, 100, 135, 161 
hierarchical mixtures of experts (see 
modular networks ), 246 
Hopfield network, 316 

asynchronous updating, 320 
attractors, 317 
basin of attraction, 317 
continuous updating, 321 
energy function, 316 
wells, 317 

hybrid learning, 123 
hyperbolic paraboloid, 138 
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ID3, 407 

implicit weight normalization, 419 
importance measure, 435 
information entropy, 362 
input selection, 435, 510 
interpolation problem, 490 
intrinsically linear, 122 
intrinsically nonlinear, 122 
inverse learning, 460 
involution, 35 

Jacobian (see Jacobian matrix ), 100 
Jacobian matrix, 100, 161, 467 

K-means clustering, 424 
Karhunen-Loeve transformation, 312 
knapsack problem, 329 
knowledge acquisition, 5, 434, 459 
Kohonen feature maps 

neighborhood function, 306 

leaky learning, 304 
learning rate rescaling, 238 
learning vector quantization, 308 
least mean square learning (see LMS 
learning ), 232 

least-squares estimator, 106 
normal equation, 106 
orthogonal operator, 112 
principle of orthogonality, 111 
projection operator, 112 
least-squares polynomial, 108 
Levenberg-Marquardt method, 129, 137, 
163, 223 

likelihood function, 120 
line minimization (see line search) , 130 
line search, 141 
bracketing, 141 
linear classifier, 203 
linear regression, 104 
linear transversal equalizer, 517 
linearization method (see Gauss-Newton 
method ), 161 
linearly separable, 229 


linguistic information, 459 
linguistic term, 54 
linguistic value, 54 
linguistic variable, 54 
orthogonal, 58 
primary terms, 55 
LMS learning, 223, 232, 328 
reverse LMS, 316 

LSE (see least-squares estimator ), 106 

machine intelligence, 5 
Madaline, 232 

Mamdani fuzzy inference system, 74 
MANFIS, 370, 509 
manufacturing intelligence, 579 
Mason’s gain formula, 217 
matrix inversion formula, 103 
max-min product, 52 
maximum likelihood estimator, 120 
membership function, 14, 372 
bell, 26 
Cauchy, 27 
extended S, 418 
Gaussian, 26 
generalized bell, 26, 374 
L-R, 29 
left-right, 29 
modified bell, 376, 575 
7 r, 44 
S, 43 

sigmoidal, 28 
trapezoidal, 25 
triangular, 25, 374 
two-sided 7 r, 44 

two-sided Gaussian, 44, 376, 388 
Z, 44 

membership matrix, 424 
MF (see membership function), 14 
minimal disturbance principle, 359 
minimum cost-complexity, 414 
minimum phase, 518 
MLE (see maximum likelihood estima- 
tor), 120 
MLP, 569 
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MLP (see multilayer perceptron) , 233, 
370 

model reference adaptive control, 467 
model-based learning, 288 
model-free learning, 288 
modified Hebbian learning rule, 316 
modified Newton’s method, 136 
modified sigmoidal function, 380, 390, 
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modular networks, 246, 290, 372, 378, 
574 

expert networks, 246 
gating network, 246 
integrating unit, 246 
local experts, 246, 574 
momentum, 158, 236 
mountain clustering method, 427 
mountain function, 427 
MRAC, see model reference adaptive 
control, 467 

multilayer perceptron, 233, 244 

NDEI (see non-dimension error index), 
356 

nearest-neighbor classification, 503 
negation, 55 
net input, 234 
network error, 462 
neural networks, 226 
neuro-fuzzy and soft computing, 1 
neuro-fuzzy computing, 1 
neuro-fuzzy control, 454, 458 
neuro-fuzzy spectrum, 382 

appropriate linguistic meanings, 

572 

dilemma between interpretability 
and precision, 382 
dilemma between precision and 
interpretability, 387 
fuzzy partitionings, 572 
ill-defined rules, 573 
interpretability spectrum, 386 
neuron functions (see activation func- 
tion), 372 


Newton’s method, 129, 135, 142 
Newton direction, 135 
Newton step, 135 
NN (see neural networks), 226 
non-dimensional error index, 356 
nonlinear least-squares, 122, 129, 160 
nonminimum phase, 518 
NP-complete, 184 
numerical data, 88 
numerical information, 459 
numerical variables, 410 

off-line learning, 210, 236, 304 
on-line learning, 210, 236 
ordered derivative, 207 
ordered variables, 410 
overfitting, 535 
overlearning, 535 

parameter identification, 362, 403, 434 
parameter sharing, 201 
parameterized T-norms, 40 
pattern-by-pattern learning (see on- 
line learning ), 210, 232, 236 
pattern-by-pattern learning (see on- 
line learning), 395 
perceptron, 204, 227 
threshold, 204 
weight, 204 

polynomial interpolation, 244 
population-based optimization, 175 
positive definite, 102 
principal component analysis, 312 
principle of incompatibility, 54 
principle of orthogonality, 314 
prototypes, 504 
pseudoinverse, 220 

Q-learning, 278 

quadratic form, 101, 134, 148, 154 
quadratic interpolation, 145 
quasi-Newton methods, 139 

BFGS (or Broyden-Fletcher-Goldfard- 
Shanno) formula, 140 
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DFP (or Davidon-Fletcher-Powell) 
formula, 140 

hereditary positive definiteness, 140 

radial basis function network, 238, 241, 
370, 372 

approximation RBFN, 243 
interpolation RBFN, 242 
random search, 186 
reverse step, 187 

RBFN (see radial basis function net- 
work), 238 

real-time recurrent learning, 213, 470 
receptive field unit, 372 
recording learning, 301 
recurrent backpropagation 
continuous operation, 212 
synchronous operation, 212 
recurrent error-propagation network, 
217 

recurrent networks, 210, 290 
recursive least-squares estimator, 114 
recursive least-squares identification, 
113 

recursive partitioning, 406 
regression coefficients, 104 
regression function, 104 
regression trees, 404 
regulator problem, 455 
reinforcement learning, 258, 480 
agent, 258 

delayed reinforcement, 262 
immediate reinforcement, 262 
immediate reward/penalty, 279 
MENACE, 261 
policy, 259 

reinforcement comparison, 263 
reward-penalty scheme, 259 
residual, 412 
Reversi, 558 

RMSE (see root-mean-squared error), 
346 

root-mean-squared error, 346 


RTRL (see real-time recurrent learn- 
ing), 213 

rulebase compression, 446 

S-norm (see also T-conorm), 38 
SA (see simulated annealing ), 181 
saddle point, 133, 137, 237 
SC (see soft computing), 1 
scale invariant, 136 
scaling, 136, 165 
scatter partition, 87, 374 
secant method, 143 
self-organizing networks, 305 
semantic rule, 54 

semi-local activation (hidden) unit, 374 
sensitivity model, 210 
signum function, 517 
similarity measure, 304, 504 
simulated annealing, 181, 326 
annealing schedule, 181 
cooling schedule, 181 
move set, 183 
sine function, 335 
sliding mode control, 495 
soft computing, 1, 7, 568 
softmax activation function, 248, 380 
specialized learning, 465 
squashing function (see activation func- 
tion), 204 

stability-plasticity dilemma, 305 
stage adaptive network, 471 
standard fuzzy operators, 24 
state equations, 454 
stationary point, 133 
steepest descent method, 129, 133 
hemstitching, 156 
step size, 130 
step-halving, 137 
Stone- Weierstrass theorem, 342 
structure identification, 403, 434 
subtractive clustering, 431 
Sugeno fuzzy model, 81 
first-order, 82 
zero-order, 82 
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sum of squared error, 105 
sum-product composition, 80 
supervised learning, 226, 259, 301 
syntactic rule, 54 
system error, 462 
system identification, 95 
target system, 95 

T-conorm, 38 

algebraic sum, 39 
bounded sum, 39 
drastic sum, 39 
maximum, 39 
T-norm, 37 

algebraic product, 37 
bounded product, 37 
drastic product, 37 
minimum, 37 

task decomposition, 247, 291, 378 
Taylor series expansion, 103, 131, 135, 
161 

TD (see also temporal difference), 264 
temporal difference, 264 
TD(0), 265 
TD(1), 265 
TD(A), 265 
term set, 54 
test data set, 97, 108 
topology- preserving maps, 305 
tracking, 212 
tracking problem, 455 
training data set, 104, 202 
trajectory adaptive network, 469 
trajectory following, 212 
transfer function (see activation func- 
tion), 233 

traveling salesperson problem, 183, 324 
tree partition, 87 
triangular norm (see T-norm), 36 
truncation filter function, 380, 390, 571 
trust-region method, 138 
TSK fuzzy model (see also Sugeno fuzzy 
model), 81 


TSP (see traveling salesperson prob- 
lem), 183 

Tsukamoto fuzzy models, 84 
twoing rule, 411 

unbiased estimator, 118 
unfolding of time, 212, 469 
unreachable workspace, 508 
unsupervised learning, 301 

validating data set, 97, 108 
variable metric methods (see quasi- 
Newton methods), 139 
vector quantization, 305 
very fast simulated reannealing, 184 
reannealing, 186 
temperature rescaling, 186 
VFSR (see very fast simulated rean- 
nealing), 184 

weakest-subtree shrinking, 414 
weight of importance, 435 
weighted average, 82, 239 
weighted least-squares estimator, 107 
weighted norm, 374 
weighted sum, 82, 239 
Widrow-Hoff learning (see LMS learn- 
ing), 232, 266 

winner-take-all learning rule, 303 
Wolfe test, 147 



